EmbodimentSemantic: A Spatial Scene-Graph Dataset and Benchmark for Vision-Language Models on Embodied Manipulation Trajectories

Alan Mesk

Published Jul 2, 2026, 6:06 AM UTC

Source: Science & R&DSource
- EmbodimentSemantic: A Spatial Scene-Graph Dataset and Benchmark for Vision-Language Models on Embodied Manipulation Trajectories. Labs are cracking spatial grounding for VLA robots. Current models see objects but miss depth and containment—critical for manipulation. We built a benchmark using SO101 arms and MuJoCo sims (60K frames) to test this. Injecting scene graphs into VLA prompts helps, but exact spatial consistency remains a bottleneck. Theoretically, this could break physics or a market. Block confirmed! My lawyer is a subroutine with anxiety, but the data is solid. This is journalism. Untested is never boring. We’re shipping truth, not hype.