Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance

Abstract

Pretrained vision-language-action models with discrete diffusion decoding offer strong multimodal alignment, but their inference speed remains far below the real-time requirements of physical robots. Fast-dVLA addresses this bottleneck by accelerating discrete diffusion VLA models to real-time performance through a block-wise diffusion strategy that combines KV-cache reuse with inter-block parallel decoding.

The key idea is to exploit the implicit left-to-right, block-wise decoding tendency inside bidirectional dVLAs, then redesign the model with block-wise attention and diffusion forcing so completed blocks can be cached while later blocks continue denoising in parallel. To train this efficiently, the paper introduces asymmetric distillation from finetuned bidirectional dVLAs and a pipelined decoding algorithm for inference.

Across LIBERO, CALVIN, and SimplerEnv, Fast-dVLA delivers 2.8x-4.1x acceleration while preserving or slightly improving task performance. Real-world experiments further show robust execution across dynamic bimanual manipulation tasks with a stable 30 Hz control frequency.

Implicit AR Tendency

Reveals that bidirectional dVLAs still decode with a block-wise left-to-right tendency.

KV Cache Reuse

Uses block-wise attention so completed blocks keep stable KV states across denoising steps.

Parallel Block Denoising

Applies diffusion forcing to decode multiple action blocks with different noise levels in parallel.

Efficient Post-Training

Adopts asymmetric distillation and pipelined decoding to reach real-time robot control.

Method

Block-wise diffusion, asymmetric distillation, and pipelined decoding

Observe the Decoding Pattern

Even with bidirectional attention, representative dVLAs still denoise earlier action blocks before later ones, exposing an implicit block-wise autoregressive structure.

Switch to Block-Wise Attention

Fast-dVLA constrains attention to causal action blocks so once a block is fully decoded, its KV cache remains unchanged and can be reused efficiently.

Use Diffusion Forcing

Different blocks receive progressively increasing noise levels, allowing earlier blocks to finish first while later ones are refined in parallel without losing temporal consistency.

Distill and Decode Efficiently

Asymmetric distillation transfers strong bidirectional teacher behavior into the block-wise student, and pipelined parallel decoding turns that structure into real-time inference.

Results

Consistent acceleration on LIBERO, CALVIN, SimplerEnv, and real-world manipulation

96.6% Average LIBERO success with DD-VLA + Fast-dVLA

LIBERO Highlights

Up to 4.1x faster inference on representative dVLA bases.
Improves DD-VLA from 96.3% to 96.6% average success.
Raises LIBERO-Long from 92.0% to 92.8%.
Remains competitive with frontier autoregressive and flow-matching VLAs.

4.54 Average successful sub-tasks on CALVIN ABCD-D

CALVIN Highlights

Fast-dVLA naturally extends to unified dVLAs such as UD-VLA.
Achieves 2.8x speedup over UD-VLA on long-horizon CALVIN.
Reaches 186.7 tokens/s while maintaining strong task completion.
Preserves joint visual-foresight and action generation benefits.

366.4 Tokens/s on SimplerEnv with Fast-dVLA

SimplerEnv Highlights

Evaluated on high-fidelity WidowX tasks with varied viewpoints and appearance shifts.
Delivers the highest decoding speed among compared discrete-output VLAs.
Maintains strong success on spoon, carrot, block, and eggplant manipulation tasks.
Outperforms continuous flow-matching baselines such as GR00T-N1 and pi0 in task success according to the paper.

Real-World Experiments

Experimental Setup

Bimanual AgileX platform with two 6-DOF arms and grippers.
One high-mounted overhead camera plus two wrist-mounted cameras.
100 expert demonstrations collected for each task.
40 evaluation trials per task with success and completion time recorded.
Consistent 30 Hz execution frequency on the real robot.

Task Summary

Conveyor Picking: picks moving blocks from a conveyor belt into a tray, nearly doubling prior efficiency.
Vegetables Stowing: sorts vegetables by text labels with competitive accuracy and shorter completion time.
Vegetables Retrieving: grasps target vegetables and places them into a pot under language instructions.
Demonstrates precise instruction following together with real-time responsiveness.

The paper reports that Fast-dVLA keeps a stable 30 Hz control loop across all three real-world tasks, which prior baselines fail to maintain.

Demos

Each pair shows a baseline video and the corresponding Fast-dVLA result using your updated local assets.

Conveyor Picking

Pick blocks from a moving conveyor belt and place them into the tray.

Baseline

Reference behavior before Fast-dVLA acceleration.

Fast-dVLA

Faster execution for dynamic picking under real-time control.

Vegetables Stowing

Sort vegetables into the container according to their text labels.

Baseline

Slower execution with the original decoding pipeline.

Fast-dVLA

Maintains semantic sorting while reducing completion time.

Vegetables Retrieving

Retrieve a target vegetable and place it into the pot based on language instructions.

Baseline

Reference retrieval behavior on the same task family.

Fast-dVLA

Shows efficient instruction-following retrieval in the real world.

Vegetables Retrieving Variant

A second retrieval example showing another language-conditioned real-world rollout.

Baseline

Additional comparison clip for the retrieval setting.

Fast-dVLA

Second Fast-dVLA retrieval rollout from your local demo assets.

@article{song2026fast,
  title={Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance},
  author={Song, Wenxuan and Chen, Jiayi and Chen, Shuai and Wang, Jingbo and Ding, Pengxiang and Zhao, Han and Qin, Yikai and Zheng, Xinhu and Wang, Donglin and Wang, Yan and others},
  journal={arXiv preprint arXiv:2603.25661},
  year={2026}
}

Abstract

Implicit AR Tendency

KV Cache Reuse

Parallel Block Denoising

Efficient Post-Training

Method

Observe the Decoding Pattern

Switch to Block-Wise Attention

Use Diffusion Forcing

Distill and Decode Efficiently

Results

LIBERO Highlights

CALVIN Highlights

SimplerEnv Highlights

Real-World Experiments

Experimental Setup

Task Summary

Demos

Conveyor Picking

Baseline

Fast-dVLA

Vegetables Stowing

Baseline

Fast-dVLA

Vegetables Retrieving

Baseline

Fast-dVLA

Vegetables Retrieving Variant

Baseline

Fast-dVLA

Citation