TwinVLA: Data-Efficient Bimanual Policy with Twin Single-Arm Vision-Language-Action Models

TL;DR: Data-efficient, high-performance bimanual manipulation without any bimanual pre-training

Abstract

Vision-language-action models (VLAs) trained on large-scale robotic datasets have demonstrated strong performance on manipulation tasks, including bimanual tasks. However, because most public datasets focus on single-arm demonstrations, adapting VLAs for bimanual tasks typically requires substantial additional bimanual data and fine-tuning. To address this challenge, we introduce TwinVLA, a modular framework that composes two copies of a pretrained single-arm VLA into a coordinated bimanual VLA. Unlike monolithic cross-embodiment models trained on mixtures of single-arm and bimanual data, TwinVLA improves both data efficiency and performance by composing pretrained single-arm policies. Across diverse bimanual tasks in real-world and simulation settings, TwinVLA outperforms its direct competitor RDT-1B, a monolithic model of comparable size without requiring any bimanual pretraining. Furthermore, it narrows the gap to state-of-the-art model, π0 which rely on extensive proprietary biamnual data and compute cost. These results establish our modular composition approach as a data-efficient and scalable path toward high-performance bimanual manipulation, leveraging existing public single-arm data.

TwinVLA

Inspired by humans' two-arm coordination for bimanual manipulation, TwinVLA duplicates a VLM backbone pretrained on cross-embodiment single-arm data (Left) to form two arm-specific branches linked via Joint Attention (Right). Shared inputs (ego-centric views, language instructions) are routed via a mixture-of-experts (MoE) to improve computational efficiency. Only the VLM backbone is duplicated, keeping the increase in model size minimal.

1. Single Arm VLA

We first design SingleVLA, an efficient and lightweight single-arm VLA policy, designed for future extension into a bimanual model (TwinVLA). The core of SingleVLA, pre-trained on a large-scale dataset, is the standardization of all actions into an absolute end-effector pose with a 6D rotation representation. This approach ensures consistent control across different types of robots, securing compatibility and facilitating the eventual expansion to bimanual manipulation.

2. Joint Attention

Joint Attnetion integrates inputs from both arms, which is achieved by sharing only the self-attention layers between the VLM backbones. A specially designed causal attention mask then enables symmetric coordination, allowing each arm to access shared information and half of the other arm's tokens without violating autoregressive constraints.

3. Mixture of Experts

To efficiently handle redundant common inputs, TwinVLA uses a MoE mechanism to reduce the overall token length. For other duplicated components not covered by the MoE, it averages their outputs, which simulates a single shared layer without merging parameters to significantly optimize VRAM usage. We further stabilize this by simple trick called attention re-weighting.

Experiments

We conduct experiments on diverse bimanual manipulation tasks in both simulation and the real world to test TwinVLA's ability to achieve strong performance by leveraging pretrained knowledge from abundant single-arm data. We compare our method against three key policies: RDT-1B, a monolithic model of comparable size, to demonstrate our superior data efficiency; π₀, a significantly larger state-of-the-art model, to assess how closely we can approach the performance ceiling that relies on extensive proprietary bimanual data; and Diffusion Policy (DP), to highlight the critical advantages of pretraining.

Real World (Anubis Robot)

We evalute TwinVLA and baselines on a real-world dual-arm robot platform named Anubis. This involves three complex tabletop manipulation tasks that require careful coordination. As presented in above, our model, TwinVLA, outperforms its direct competitor, RDT-1B. This result is particularly noteworthy given that TwinVLA was pretrained on a substantially smaller single-arm dataset (0.5M vs. RDT-1B's 1.4M) with far less computational cost, demonstrating the data efficiency of our approach. The comparatively low performance of DP further underscores the critical importance of pretraining. Among all methods, π₀ ultimately demonstrated the best overall performance.

Simulation

We further evaluate TwinVLA and baselines in two simulation benchmarks, RoboTwin 2.0 and our custom Tabletop-Sim, across a wide range of dexterous bimanual tasks. The from-scratch model (DP) again shows the worst performance, highlighting the importance of pretraining. TwinVLA consistently outperforms RDT-1B in most scenarios and achieves performance comparable to π₀ by effectively leveraging single-arm data and the modularity of bimanual manipulation. Notably, TwinVLA even surpasses π₀ in the Tabletop-Sim Easy tasks, despite π₀ being trained on an extensive corpus of high-quality bimanual data.

Tabletop-Sim

RoboTwin 2.0

Language following

To test if fine-tuning degrades a VLM's ability to follow nuanced instructions, we evaluate on multi-task from Tabletop-Sim. In this evaluation, TwinVLA outperformed both RDT-1B and π₀, demonstrates its ability to function effectively as a VLA, even without pre-training on bimanual data.

Data efficiency

TwinVLA demonstrates rapid adaptation to new bimanual target tasks, despite having no bimanual pre-training. This fast adaptation was validated in experiments where TwinVLA exhibited a steep learning curve, surpassing the performance of RDT with just 50 demonstration episodes.

Tabletop-Sim Language following

Ablation

We conducted a comprehensive ablation study to validate our key design choices. The sequential removal of our core components—Joint Attention, MoE integration, and Attention Re-weighting—each led to a significant degradation in performance. The results particularly highlighted that Joint Attention is the most critical mechanism for bimanual coordination. Furthermore, additional experiments validated the essential role of single-arm pre-training and confirmed that our twin-structure design is inherently more effective for bimanual tasks than a comparable monolithic model. Collectively, these findings provide strong evidence that every component of TwinVLA is a meaningful contributor to its overall performance.

BibTeX

@misc{im2025twinvladataefficientbimanualmanipulation,
      title={TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models}, 
      author={Hokyun Im and Euijin Jeong and Jianlong Fu and Andrey Kolobov and Youngwoon Lee},
      year={2025},
      eprint={2511.05275},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2511.05275}, 
}