Vision-language-action models (VLAs) trained on large-scale robotic datasets have demonstrated strong
performance on manipulation tasks, including bimanual tasks.
However, because most public datasets focus on single-arm demonstrations, adapting VLAs for bimanual
tasks typically requires substantial additional bimanual data and fine-tuning. To address this
challenge, we introduce TwinVLA, a modular framework that composes two copies of a pretrained
single-arm VLA into a coordinated bimanual VLA.
Unlike monolithic cross-embodiment models trained on mixtures of single-arm and bimanual data, TwinVLA
improves both data efficiency and performance by composing pretrained single-arm policies. Across
diverse bimanual tasks in real-world and simulation settings, TwinVLA outperforms its direct
competitor RDT-1B, a monolithic model of comparable size without requiring any bimanual
pretraining.
Furthermore, it narrows the gap to state-of-the-art model, π0 which rely on extensive proprietary
biamnual data and compute cost. These results establish our modular composition approach as a
data-efficient and scalable path toward high-performance bimanual manipulation, leveraging existing
public single-arm data.