Wall-OSS-0.5: 4B VLA with open training code and zero-shot real-robot evaluation[D]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
Wall-OSS-0.5 is a new 4B VLA release from X Square Robot, built on a 3B VLM backbone with action experts in a Mixture-of-Transformers layout. What caught my eye is that the report evaluates the pretrained checkpoint on real robots before task-specific fine tuning, instead of only reporting downstream fine-tuned performance.
The reported numbers are: zero shot on a 17-task real-robot suite, 4 tasks above 80 task progress, including a held-out deformable task (Rope Tightening, 82). After fine tuning on a 15-task suite, they report 60.5 average task progress, +17.5pp over pi0.5, and +26pp on the 10-task manipulation subset. They also report +21.8pp on embodied grounding while general VL ability stays stable.
The method bits I am trying to sanity check are the gradient bridge and the optimizer claim. They argue that discrete action-token CE is the dominant gradient into the VLM backbone, while flow matching's contribution to backbone updates collapses to roughly 5 percent within a few thousand steps. The Vision-Aligned RVQ tokenizer is supposed to make those action tokens semantically grounded instead of just numerical compression. For continuous actions, they still use flow matching, but supervise in recovered action space rather than velocity space. They also include DMuon, a distributed Muon optimizer, with a pretty aggressive overhead reduction claim.
Code: https://github.com/X-Square-Robot/wall-x. Hugging Face org: https://huggingface.co/x-square-robot. Project page: https://x2robot.com/oss#resources. Paper: https://x2robot.com/api/files/file/wall_oss_05.pdf
The questions I had after reading it: if you have run an analogous gradient-bridge ablation in another VLA, did action-token CE dominate in the same way? For people already using Muon, does the DMuon overhead claim sound plausible? And has anyone seen RVQ-with-vision-alignment clearly beat FAST-style tokenization outside this paper?
If anyone is already trying to reproduce this on real hardware, drop notes. The third-party results will matter more than the release numbers.
[link] [comments]
More from r/MachineLearning
-
Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems [R]
May 28
-
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
May 28
-
Built a richer reading layer for arxiv (Chrome extension + web): OpenReview reviews, GitHub/HuggingFace links, citation graph, SPECTER2 neighbors, TLDRs. 3M papers, free, looking for feedback [P]
May 28
-
A new dataset with more that 100M hi-quality, curated images, with captions and meta data! [P]
May 28
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.