ZhiYuan Robotics Releases and Open Sources the World's First Robot Action Sequence Driven World Model

Author | Xu Xuanju

Recently, ZhiYuan Robotics made a significant announcement in the field of embodied intelligence, achieving a dual milestone: the world’s first robot action sequence driven embodied world model EVAC (EnerVerse-AC), and the embodied world model evaluation benchmark EWMBench. These two innovative results are now fully open-sourced, aiming to establish a new development paradigm of “low-cost simulation – standardized evaluation – efficient iteration,” continuously empowering global embodied intelligence research and accelerating technology implementation and industrial development.

EVAC arxiv: https://arxiv.org/abs/2505.09723

EVAC open-source code: https://github.com/AgibotTech/EnerVerse-AC

EWMBench arxiv: https://arxiv.org/abs/2505.09694

EWMBench open-source code: https://github.com/AgibotTech/EWMBench

The current evolution of embodied intelligence faces two key constraints: during the testing phase, real machine validation is costly and risky, while simulation systems are limited by the reality-simulation gap; at the data level, a vast amount of real machine data has not yet established an efficient utilization mechanism based on trajectory augmentation, limiting diversity generation and generalization training. To break this deadlock, ZhiYuan Robotics has launched innovative results based on the world model architecture EnerVerse released last year: the action sequence driven world model EVAC and the embodied world model evaluation leaderboard EWMBench, constructing a full-chain technical closed loop from training to evaluation, redefining the research paradigm of embodied world models.

The world’s first robot action sequence driven world model

EVAC is a world model capable of dynamically reproducing complex interactions between robots and their environments, marking a leap from traditional simulation to generative simulation.

Core capabilities: Precise mapping from “physical execution” to “pixel space”

EVAC continues to evolve based on the previous work of the EnerVerse architecture, innovatively introducing a multi-level action condition injection mechanism to achieve end-to-end generation of “physical actions – visual dynamics.” Its core capabilities are reflected in the following aspects:

High-precision alignment of robot actions and pixels: Projecting the 6D pose (x,y,z,roll,pitch,yaw) of the robotic arm and the end effector’s stroke into an action map, ensuring pixel-level alignment between physical actions and image frames, accurately modeling complex dynamic behaviors such as “grasping,” “placing,” “colliding,” “pushing and pulling,” “fast throwing,” and “slow shaking”;
Dynamic multi-view modeling: Introducing Ray Map to encode camera motion trajectories, supporting consistent and coherent visual scene generation from multiple perspectives such as head and wrist, endowing robots with more comprehensive environmental generation capabilities;
Excellent long-term temporal consistency: Using a Chunk-Wise autoregressive diffusion architecture and a Sparse Memory mechanism, EVAC can achieve stable generation of up to 30 continuous segments from a single view, and maintain stable output of 10 continuous segments from multiple views without drift, ensuring coherence and authenticity in the simulation process over time;
Efficient data utilization: Integrating the Agibot-World dataset + failure trajectories (such as grasping slips, path collisions) to enhance generation quality, this strategy effectively suppresses hallucination phenomena, allowing the model to more reasonably and comprehensively model the dynamic interactions between robots and their environments.

Generative simulation evaluation + data engine dual drive

Generative simulation evaluation

In response to the high costs, risks, and reproducibility challenges of real machine evaluations, EVAC has innovatively proposed a generative simulation evaluation scheme that can perform alternating reasoning with the strategy model to construct a complete interactive evaluation pipeline. Experiments show that the evaluation results generated by EVAC are highly consistent with the success rates of real machine evaluations across multiple tasks, and can even reliably identify model weights with superior performance, significantly improving the efficiency of strategy model selection.

Data augmentation engine

EVAC can perform large-scale data augmentation based on a minimal amount of expert trajectory data through action interpolation and high-fidelity image generation techniques. Application results indicate that strategy models trained with EVAC augmented data achieve a task success rate improvement of up to 29%, with significantly enhanced target tracking capabilities, validating the practicality and cost-effectiveness of this approach in embodied intelligence research.

Creating a “quality inspection standard” for embodied world models

To scientifically and systematically measure the performance of embodied world models, ZhiYuan Robotics has launched the world’s first embodied world model evaluation benchmark—EWMBench, aiming to fill the industry gap and establish unified, credible evaluation standards.

Three-dimensional evaluation system: Scene × Action × Semantic assessment

In response to the complexity and specificity of robot operation scenarios, EWMBench has constructed a three-dimensional evaluation system, analyzing three core indicators: scene consistency, action correctness, and semantic alignment & diversity:

Scene Consistency: Evaluating the stability and authenticity of backgrounds/objects/perspectives in generated scenes, quantified using fine-tuned DINOv2 features.
Motion Correctness: Utilizing three complementary metrics—HSD (Symmetric Hausdorff Distance), nDTW (normalized Dynamic Time Warping), and Dynamics Score—to accurately assess the reasonableness and dynamic authenticity of generated actions.
Semantic Alignment & Diversity: Combining MLLM (Multimodal Large Model) and CLIP to evaluate semantic understanding of generated videos across multiple levels, including global instruction alignment, key step semantic accuracy, and logical coherence.

Authoritative data support and convenient open-source tools

Benchmark dataset: EWMBench is built on the industry-leading open-source million real machine dataset AgiBot World, covering 10 typical robot operation tasks across three major scenarios: home, industrial, and medical, as well as various interactive objects such as rigid, soft, fluid, and jointed objects, including over 300 carefully designed test samples and 30% challenging scenarios (low light/partial occlusion), comprehensively validating the model’s robustness in complex environments.
Open-source evaluation tools: ZhiYuan Robotics has simultaneously open-sourced a full-process evaluation tool that supports one-click generation of standardized comparison reports, significantly lowering evaluation thresholds, facilitating researchers to quickly conduct model comparisons and performance analyses, and accelerating experimental validation and result reproduction.

Outstanding evaluation performance: Closer to human subjective perception

Compared to the current mainstream video generation evaluation benchmark VBench, EWMBench performs better in terms of consistency with human subjective judgments, able to more realistically and intricately reflect the actual capabilities of embodied world models in core dimensions such as interactive understanding, action restoration, and visual consistency.

EnerVerse, as a powerful world model infrastructure, provides a reliable foundational framework and pre-training capability for EVAC, while the diverse high-quality data generated by EVAC can further optimize the EnerVerse model, forming a “training – validation” technical closed loop that continuously drives model performance breakthroughs. Through the refined, multi-dimensional quantitative analysis provided by EWMBench, the R&D team can accurately identify potential shortcomings of EVAC in handling complex scenarios such as “multi-object interaction” and “dynamic environment obstacle avoidance,” allowing for more targeted optimizations.

It is reported that the combination of EVAC and EWMBench has officially been selected as the baseline system and evaluation standard for the AgiBot World Challenge @ IROS 2025 – World Model track.

Click belowRead the original textVisit InfoQ’s official website for more exciting content!

Today’s recommended articles“The strongest coding model” is online, exclusive revelations from Claude’s core engineer: can work around the clock by the end of the year, DeepSeek is not cutting-edgeUsing Indian programmers impersonating AI, the “unicorn” completely collapsed! Pseudo AI burned $500 million, even Microsoft and Amazon were “duped”Jeff Dean: AI will replace junior engineers within a year, netizens: “Altman only makes promises, what Jeff says is deadly”Hundreds of resumes, zero offers, a 42-year-old PHP programmer survives by driving for ride-hailing: In the AI era, the mid-life crisis is unfolding?

ZhiYuan Robotics Releases and Open Sources the World’s First Robot Action Sequence Driven World Model

Leave a Comment Cancel reply

Related posts

Leave a Comment Cancel reply