StreamVLN Deployment on AGIBOT D1 Quadruped Robot - End-to-End Closed-Loop Navigation Practice
Recently, I completed a highly rewarding project—successfully deploying large-model-based Vision-Language Navigation (VLN) for real-world end-to-end online closed-loop control on the AGIBOT D1 quadruped robot. Transforming from pure simulation/offline scripts to a robot that can understand natural language, perceive its environment, and navigate autonomously—this full pipeline is now fully operational. This blog post comprehensively documents the system architecture, core modules, challenges encountered, and future plans.
1 Project Overview
Built on the ROS1 framework, this project integrates real-time RGB visual input, natural language navigation commands, StreamVLN large-model inference, and optimized PID low-level motion control into a complete “Perception → Decision-Making → Execution” closed-loop system. It enables the quadruped robot to navigate indoors autonomously using only natural language instructions.
Key Achievements:
- Established a full-link closed-loop system from camera/odometry → VLN service → velocity control
- Implemented mapping from discrete high-level actions to continuous velocity commands with refined PID control for smoother motion
- Supported autonomous online navigation driven by natural language instructions
Basic functionality has been validated in indoor corridor scenarios. The next phase will focus on motion smoothness, robustness across multiple scenarios, and quantitative evaluation of task success rates.
2 System Architecture
2.1 Hardware Platform: AGIBOT D1 Quadruped Robot
The entire system runs on the AGIBOT D1 quadruped platform:
- Onboard Main Controller: ASUS NUC
- Perception Sensors: Odin1 Spatial Perception Module (providing RGB images and high-frequency odometry)
2.2 Software Architecture: ROS1 Modular Design
The system adopts a three-layer modular architecture with clean decoupling, facilitating easy replacement of models/controllers in subsequent iterations.
(1) Perception Layer
odin1Node: Publishes undistorted RGB images via the topic/odin1/image/undistorted- High-frequency odometry and velocity information via
/odin1/odometry
(2) Decision-Making Layer (Core)
- Core Node:
d1_vln_client - Subscribes to images and odometry, then calls the external VLN HTTP service
- Outputs high-level action sequences, which are converted to
TwistStampedcontrol commands via PID
(3) Execution Layer
zslbot_vel_controller: Subscribes to/cmd_vel- Performs velocity clipping and safety checks to drive the robot base
(4) Auxiliary Nodes
- Remote Control Takeover:
joy_node - Depth Map Support (To be improved):
pcd2depth_node
2.3 ROS Message Flow (rqt_graph)
The data flow is very clear:
odin1→ RGB images + odometry- Images + odometry →
d1_vln_client d1_vln_client↔ VLN Server (HTTP POST)d1_vln_client→/cmd_velzslbot_vel_controller→ Robot Execution
2.4 d1_vln_client Multi-Threaded Design
To ensure real-time control performance, the client adopts a dual-thread separation design:
- planning_thread: Calls the VLN model to output discrete actions (0-STOP/1-FORWARD/2-TURN LEFT/3-TURN RIGHT)
- control_thread: Runs at approximately 10Hz, continuously publishing velocity commands using optimized PID
3 Core Technical Modules
3.1 VLN Decision-Making Layer: Server + Client
(1) VLN Server (http_realworld_server_v3.py)
- Built with Flask for HTTP services
- Backbone: StreamVLNForCausalLM (Qwen-1.5, bfloat16, single-card CUDA)
- Triggers inference every 4 steps to reduce latency
- Automatically saves annotated images with actions/timestamps
- Supports predicting future 4-step actions (
num_future_steps=4)
(2) VLN Client (d1_client_new.py)
- Sends JPEG images + JSON (commands + reset flags) via
POST - Clears model memory with the first call
reset=True - Converts actions to 4x4 homogeneous target poses (step size 0.25m, rotation angle ±15°)
s- Delivers target poses to the low-level controller
3.2 Low-Level Motion Control: PID Optimization (pid_controller_v2.py)
Extensive anti-jitter/smoothing optimizations were performed for the 10Hz control frequency.
Core Optimizations
- Exponential Smoothing (α=0.14)
- Translation dead zone 4cm, Heading dead zone ~4.5°
- Rate clipping: Δv=0.18m/s, Δw=0.25rad/s
- Prohibition of tiny negative linear velocity
- Automatic replanning triggered when pose error falls below a threshold
4 Full Execution Pipeline
- RGB images + odometry →
d1_vln_client - VLN inference → 4-step action sequence
- Update homogeneous target poses based on actions
- PID 10Hz closed-loop →
/cmd_vel→ Robot Execution
5 Experimental Results
We tested with a typical continuous instruction:
“go straight along the office corridor. when you see the paper box, turn left , then go forwards, stop in front of the door.”
The system can output a continuous action sequence and finally stop at the specified position.
The real robot trajectory and PID control curve also verify closed-loop stability.
Qualitative Conclusions
- Completed the milestone from offline testing → real-world online closed-loop
- Complete decoupling of high-level semantic decision-making (VLN) and low-level control (PID)
- Issues such as dead zones and inconsistent message types have been localized and partially resolved
6 Current Challenges and Risks
- Initialization Failure: First-frame inference may cause the robot to spin in place in narrow spaces
- Traditional Local Planners Not Integrated: DWA/TEB not yet connected
- Incomplete Depth Map Integration: Obstacle perception relies on RGB, with limited obstacle avoidance
Next Steps
- Model Upgrade
- More efficient StreamVLN variants
- Fallback mechanism for initial planning failure
- Motion Smoothing
- Introduce DWA/TEB to replace/complement pure PID
- System Robustness
- Resolve initialization failure to support multiple indoor/outdoor scenarios
- Perception Enhancement
- Full integration of depth maps for improved obstacle avoidance
- Toolchain
- One-click startup script
- Web UI: Real-time command issuance, status monitoring, log viewing
Summary
This deployment achieved the first real-time autonomous navigation closed-loop driven by vision-language on the AGIBOT D1. The modular decoupled architecture provides an excellent foundation for subsequent iterations. The project focus will shift from “whether the link works” to “stable control, generalizable scenarios, and successful task execution”, paving the way for more complex VLN deployments.