DFM-VLA

Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching

Jiayi Chen1,2* Wenxuan Song1* Shuai Chen3,4 Jingbo Wang1 Zhijun Li2† Haoang Li1†
1The Hong Kong University of Science and Technology (Guangzhou) 2Harbin Institute of Technology 3ShanghaiTech University 4Shanghai Institute of Technical Physics, CAS
4.44 CALVIN Avg. Len.
95.7% LIBERO Avg.
70.8% Real-World Avg.
Scroll to explore

Abstract

Comparison of discrete VLA paradigms

Vision-Language-Action models that use discrete action tokenization are increasingly popular for robotic manipulation, but existing decoding paradigms remain constrained by a shared weakness: once an action token is generated, it is usually fixed and cannot be corrected in later iterations.

DFM-VLA addresses this limitation with discrete flow matching for iterative action refinement. Instead of committing to a final token sequence in one pass, DFM-VLA models a token-level probability velocity field that updates the full action sequence over multiple refinement steps. The paper studies both an action-embedding-guided velocity formulation and an auxiliary velocity-head formulation.

A two-stage decoding pipeline combines iterative refinement with deterministic validation for stable convergence. Across CALVIN, LIBERO, and real-world manipulation, DFM-VLA consistently improves manipulation quality while keeping strong inference efficiency, reaching 4.44 average success length on CALVIN and 95.7% average success rate on LIBERO.

Iterative Refinement

Previously generated action tokens can be revised instead of being permanently fixed after one prediction.

Velocity Field Modeling

DFM-VLA learns token-level transition rates that guide action updates toward cleaner trajectories.

Two-Stage Decoding

Iterative refinement is followed by deterministic validation for more stable final action sequences.

Efficient Inference

Adaptive KV caching provides up to 2.4x latency speedup over autoregressive decoding while preserving performance.

Method

Unified discrete token modeling with flow-based action refinement

DFM-VLA overall architecture

Training schematic of DFM-VLA. The model predicts clean actions from noised action tokens and learns the velocity field during training.

01

Unified Tokenization

DFM-VLA follows a UniVLA-style discrete formulation where language, third-view images, wrist-view images, and actions are all represented as discrete tokens.

02

Velocity-Guided Action Updates

The model predicts a clean target action sequence and constructs a probability velocity field that tells each token how to move toward better actions over time.

03

Embedding-Guided or Head-Based Velocities

The paper studies two constructions: an embedding-guided formulation built from token distances and a separate velocity head that predicts replacement rates directly.

04

Refinement Then Validation

Inference first performs stochastic iterative refinement with CTMC Euler updates, then switches to a short deterministic validation phase for stable final convergence.

Inference

One decoding step of DFM-VLA inference

Inference visualization of one refinement step. DFM-VLA updates action tokens iteratively instead of fixing them after a single prediction.

Results

Simulation and real-world evaluations of iterative action refinement

Simulation Experiments

DFM-VLA simulation results on CALVIN

Simulation benchmark figure on CALVIN.

DFM-VLA simulation results on LIBERO

Simulation benchmark figure on LIBERO.

95.7% LIBERO average success with DFM-VLA + Embed

Benchmark Highlights

  • Achieves the best CALVIN average success length of 4.44.
  • Reaches 98.8% on LIBERO-Object and 92.6% on LIBERO-Long.
  • Embedding-guided velocity modeling converges faster and stronger than the auxiliary velocity-head variant.
  • Improves over DreamVLA by +3.1 on LIBERO average and over FlowVLA by +7.6.

Real-World Experiments

DFM-VLA real-world setup and representative tasks

Experimental setup and representative real-world task scenes.

Experimental Setup

  • Bimanual AgileX platform with two 6-DoF robotic arms and parallel grippers.
  • Three RGB cameras: one central elevated view and two wrist-mounted views.
  • 100 trajectories collected for training on each task.
  • 40 evaluation trials per task, reporting success rate and completion speed.
  • Compared against pi_0-FAST, Dream-VLA, and the continuous diffusion baseline RDT.

Task Summary

  • Pot Lift: bimanual collaborative lifting requiring precise coordination.
  • Place Veg. to Pot: grasp elongated vegetables and place them into the pot under varied poses.
  • Place Block to Plate: place a block on a plate with varying height for spatial precision.
  • DFM-VLA + Embed achieves 77.5%, 70.0%, and 65.0% on the three tasks, averaging 70.8%.

The real-world average of 70.8% surpasses RDT by 10.8 points and Dream-VLA by 16.6 points, showing that iterative refinement remains highly competitive outside simulation.

DFM-VLA real-world experiment figure

Real-world performance comparison across the three tasks.

Demos

Representative real-world rollouts from the three DFM-VLA tasks in the paper.

Pot Lift and Place Veg. to Pot

The first row combines one Pot Lift rollout with two Place Veg. to Pot rollouts.

Pot Lift

Place Veg. to Pot 1

Place Veg. to Pot 2

Place Block to Plate

Place a block onto a plate with changing height and spatial layout.

Demo 1

Demo 2

Demo 3

Representative real-world rollouts of DFM-VLA on "Pot Lift", "Place Veg. to Pot", and "Place Block to Plate", showing stable task execution across coordinated bimanual manipulation, varied object poses, and changing spatial layouts.

Citation

@misc{chen2026dfmvla,
  title        = {DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching},
  author       = {Jiayi Chen and Wenxuan Song and Shuai Chen and Jingbo Wang and Zhijun Li and Haoang Li},
  year         = {2026},
  note         = {Preprint},
  howpublished = {\url{https://arxiv.org/html/2603.26320v1}}
}