StreamVLN Deployment on AGIBOT D1 Quadruped Robot - End-to-End Closed-Loop Navigation Practice

Posted at 2026-03

Recently, I completed a highly rewarding project—successfully deploying large-model-based Vision-Language Navigation (VLN) for real-world end-to-end online closed-loop control on the AGIBOT D1 quadruped robot. Transforming from pure simulation/offline scripts to a robot that can understand natural language, perceive its environment, and navigate autonomously—this full pipeline is now fully operational. This blog post comprehensively documents the system architecture, core modules, challenges encountered, and future plans.

1 Project Overview

Built on the ROS1 framework, this project integrates real-time RGB visual input, natural language navigation commands, StreamVLN large-model inference, and optimized PID low-level motion control into a complete “Perception → Decision-Making → Execution” closed-loop system. It enables the quadruped robot to navigate indoors autonomously using only natural language instructions.

Key Achievements:

Established a full-link closed-loop system from camera/odometry → VLN service → velocity control
Implemented mapping from discrete high-level actions to continuous velocity commands with refined PID control for smoother motion
Supported autonomous online navigation driven by natural language instructions

Basic functionality has been validated in indoor corridor scenarios. The next phase will focus on motion smoothness, robustness across multiple scenarios, and quantitative evaluation of task success rates.

2 System Architecture

2.1 Hardware Platform: AGIBOT D1 Quadruped Robot

The entire system runs on the AGIBOT D1 quadruped platform:

Onboard Main Controller: ASUS NUC
Perception Sensors: Odin1 Spatial Perception Module (providing RGB images and high-frequency odometry)

2.2 Software Architecture: ROS1 Modular Design

The system adopts a three-layer modular architecture with clean decoupling, facilitating easy replacement of models/controllers in subsequent iterations.

(1) Perception Layer

odin1 Node: Publishes undistorted RGB images via the topic /odin1/image/undistorted
High-frequency odometry and velocity information via /odin1/odometry

(2) Decision-Making Layer (Core)

Core Node: d1_vln_client
Subscribes to images and odometry, then calls the external VLN HTTP service
Outputs high-level action sequences, which are converted to TwistStamped control commands via PID

(3) Execution Layer

zslbot_vel_controller: Subscribes to /cmd_vel
Performs velocity clipping and safety checks to drive the robot base

(4) Auxiliary Nodes

Remote Control Takeover: joy_node
Depth Map Support (To be improved): pcd2depth_node

2.3 ROS Message Flow (rqt_graph)

The data flow is very clear:

odin1 → RGB images + odometry
Images + odometry → d1_vln_client
d1_vln_client ↔ VLN Server (HTTP POST)
d1_vln_client → /cmd_vel
zslbot_vel_controller → Robot Execution

2.4 d1_vln_client Multi-Threaded Design

To ensure real-time control performance, the client adopts a dual-thread separation design:

planning_thread: Calls the VLN model to output discrete actions (0-STOP/1-FORWARD/2-TURN LEFT/3-TURN RIGHT)
control_thread: Runs at approximately 10Hz, continuously publishing velocity commands using optimized PID

3 Core Technical Modules

3.1 VLN Decision-Making Layer: Server + Client

(1) VLN Server (http_realworld_server_v3.py)

Built with Flask for HTTP services
Backbone: StreamVLNForCausalLM (Qwen-1.5, bfloat16, single-card CUDA)
Triggers inference every 4 steps to reduce latency
Automatically saves annotated images with actions/timestamps
Supports predicting future 4-step actions (num_future_steps=4)

(2) VLN Client (d1_client_new.py)

Sends JPEG images + JSON (commands + reset flags) via POST
Clears model memory with the first call reset=True
Converts actions to 4x4 homogeneous target poses (step size 0.25m, rotation angle ±15°)
s- Delivers target poses to the low-level controller

3.2 Low-Level Motion Control: PID Optimization (pid_controller_v2.py)

Extensive anti-jitter/smoothing optimizations were performed for the 10Hz control frequency.

Core Optimizations

Exponential Smoothing (α=0.14)
Translation dead zone 4cm, Heading dead zone ~4.5°
Rate clipping: Δv=0.18m/s, Δw=0.25rad/s
Prohibition of tiny negative linear velocity
Automatic replanning triggered when pose error falls below a threshold

4 Full Execution Pipeline

RGB images + odometry → d1_vln_client
VLN inference → 4-step action sequence
Update homogeneous target poses based on actions
PID 10Hz closed-loop → /cmd_vel → Robot Execution

5 Experimental Results

We tested with a typical continuous instruction:
“go straight along the office corridor. when you see the paper box, turn left , then go forwards, stop in front of the door.”

The system can output a continuous action sequence and finally stop at the specified position.

The real robot trajectory and PID control curve also verify closed-loop stability.

Qualitative Conclusions

Completed the milestone from offline testing → real-world online closed-loop
Complete decoupling of high-level semantic decision-making (VLN) and low-level control (PID)
Issues such as dead zones and inconsistent message types have been localized and partially resolved

6 Current Challenges and Risks

Initialization Failure: First-frame inference may cause the robot to spin in place in narrow spaces
Traditional Local Planners Not Integrated: DWA/TEB not yet connected
Incomplete Depth Map Integration: Obstacle perception relies on RGB, with limited obstacle avoidance

Next Steps

Model Upgrade
- More efficient StreamVLN variants
- Fallback mechanism for initial planning failure
Motion Smoothing
- Introduce DWA/TEB to replace/complement pure PID
System Robustness
- Resolve initialization failure to support multiple indoor/outdoor scenarios
Perception Enhancement
- Full integration of depth maps for improved obstacle avoidance
Toolchain
- One-click startup script
- Web UI: Real-time command issuance, status monitoring, log viewing

Summary

This deployment achieved the first real-time autonomous navigation closed-loop driven by vision-language on the AGIBOT D1. The modular decoupled architecture provides an excellent foundation for subsequent iterations. The project focus will shift from “whether the link works” to “stable control, generalizable scenarios, and successful task execution”, paving the way for more complex VLN deployments.