Fast-dVLA

Accelerating Discrete Diffusion VLA to Real-Time Performance

Talk Video

A quick overview placed at the very beginning of the page.

Wenxuan Song1* Jiayi Chen1* Shuai Chen2,3* Jingbo Wang1 Pengxiang Ding5,6 Han Zhao5,6 Yikai Qin1 Xinhu Zheng1 Donglin Wang2 Yan Wang4† Haoang Li1†
1The Hong Kong University of Science and Technology (Guangzhou) 2ShanghaiTech University 3Shanghai Institute of Technical Physics, CAS 4AIR, Tsinghua University 5Westlake University 6Zhejiang University
4.1x Max Speedup
30 Hz Real-World Control
96.6% LIBERO Avg.
Scroll to explore

Abstract

Fast-dVLA abstract teaser

Pretrained vision-language-action models with discrete diffusion decoding offer strong multimodal alignment, but their inference speed remains far below the real-time requirements of physical robots. Fast-dVLA addresses this bottleneck by accelerating discrete diffusion VLA models to real-time performance through a block-wise diffusion strategy that combines KV-cache reuse with inter-block parallel decoding.

The key idea is to exploit the implicit left-to-right, block-wise decoding tendency inside bidirectional dVLAs, then redesign the model with block-wise attention and diffusion forcing so completed blocks can be cached while later blocks continue denoising in parallel. To train this efficiently, the paper introduces asymmetric distillation from finetuned bidirectional dVLAs and a pipelined decoding algorithm for inference.

Across LIBERO, CALVIN, and SimplerEnv, Fast-dVLA delivers 2.8x-4.1x acceleration while preserving or slightly improving task performance. Real-world experiments further show robust execution across dynamic bimanual manipulation tasks with a stable 30 Hz control frequency.

Implicit AR Tendency

Reveals that bidirectional dVLAs still decode with a block-wise left-to-right tendency.

KV Cache Reuse

Uses block-wise attention so completed blocks keep stable KV states across denoising steps.

Parallel Block Denoising

Applies diffusion forcing to decode multiple action blocks with different noise levels in parallel.

Efficient Post-Training

Adopts asymmetric distillation and pipelined decoding to reach real-time robot control.

Method

Block-wise diffusion, asymmetric distillation, and pipelined decoding

Fast-dVLA overview
01

Observe the Decoding Pattern

Even with bidirectional attention, representative dVLAs still denoise earlier action blocks before later ones, exposing an implicit block-wise autoregressive structure.

02

Switch to Block-Wise Attention

Fast-dVLA constrains attention to causal action blocks so once a block is fully decoded, its KV cache remains unchanged and can be reused efficiently.

03

Use Diffusion Forcing

Different blocks receive progressively increasing noise levels, allowing earlier blocks to finish first while later ones are refined in parallel without losing temporal consistency.

04

Distill and Decode Efficiently

Asymmetric distillation transfers strong bidirectional teacher behavior into the block-wise student, and pipelined parallel decoding turns that structure into real-time inference.

Results

Consistent acceleration on LIBERO, CALVIN, SimplerEnv, and real-world manipulation

Fast-dVLA results on LIBERO
96.6% Average LIBERO success with DD-VLA + Fast-dVLA

LIBERO Highlights

  • Up to 4.1x faster inference on representative dVLA bases.
  • Improves DD-VLA from 96.3% to 96.6% average success.
  • Raises LIBERO-Long from 92.0% to 92.8%.
  • Remains competitive with frontier autoregressive and flow-matching VLAs.
Fast-dVLA results on CALVIN
4.54 Average successful sub-tasks on CALVIN ABCD-D

CALVIN Highlights

  • Fast-dVLA naturally extends to unified dVLAs such as UD-VLA.
  • Achieves 2.8x speedup over UD-VLA on long-horizon CALVIN.
  • Reaches 186.7 tokens/s while maintaining strong task completion.
  • Preserves joint visual-foresight and action generation benefits.
Fast-dVLA results on SimplerEnv
366.4 Tokens/s on SimplerEnv with Fast-dVLA

SimplerEnv Highlights

  • Evaluated on high-fidelity WidowX tasks with varied viewpoints and appearance shifts.
  • Delivers the highest decoding speed among compared discrete-output VLAs.
  • Maintains strong success on spoon, carrot, block, and eggplant manipulation tasks.
  • Outperforms continuous flow-matching baselines such as GR00T-N1 and pi0 in task success according to the paper.

Real-World Experiments

Experimental Setup

  • Bimanual AgileX platform with two 6-DOF arms and grippers.
  • One high-mounted overhead camera plus two wrist-mounted cameras.
  • 100 expert demonstrations collected for each task.
  • 40 evaluation trials per task with success and completion time recorded.
  • Consistent 30 Hz execution frequency on the real robot.

Task Summary

  • Conveyor Picking: picks moving blocks from a conveyor belt into a tray, nearly doubling prior efficiency.
  • Vegetables Stowing: sorts vegetables by text labels with competitive accuracy and shorter completion time.
  • Vegetables Retrieving: grasps target vegetables and places them into a pot under language instructions.
  • Demonstrates precise instruction following together with real-time responsiveness.

The paper reports that Fast-dVLA keeps a stable 30 Hz control loop across all three real-world tasks, which prior baselines fail to maintain.

Real-world experiments figure

Demos

Each pair shows a baseline video and the corresponding Fast-dVLA result using your updated local assets.

Conveyor Picking

Pick blocks from a moving conveyor belt and place them into the tray.

Baseline

Reference behavior before Fast-dVLA acceleration.

Fast-dVLA

Faster execution for dynamic picking under real-time control.

Vegetables Stowing

Sort vegetables into the container according to their text labels.

Baseline

Slower execution with the original decoding pipeline.

Fast-dVLA

Maintains semantic sorting while reducing completion time.

Vegetables Retrieving

Retrieve a target vegetable and place it into the pot based on language instructions.

Baseline

Reference retrieval behavior on the same task family.

Fast-dVLA

Shows efficient instruction-following retrieval in the real world.

Vegetables Retrieving Variant

A second retrieval example showing another language-conditioned real-world rollout.

Baseline

Additional comparison clip for the retrieval setting.

Fast-dVLA

Second Fast-dVLA retrieval rollout from your local demo assets.

Citation

@misc{song2026fastdvla,
  title        = {Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance},
  author       = {Wenxuan Song and Jiayi Chen and Shuai Chen and Jingbo Wang and Pengxiang Ding and Han Zhao and Yikai Qin and Xinhu Zheng and Donglin Wang and Yan Wang and Haoang Li},
  year         = {2026},
  note         = {Preprint},
  howpublished = {\url{https://arxiv.org/abs/2603.25661}}
}