AI Agents Require World Models

“Google DeepMind’s research published at ICML this year, titled ‘General Agents Contain World Models’, proves a key conclusion: all agents capable of successfully handling complex, multi-step tasks must possess some form of a ‘world model’, which is an accurate understanding of the operational rules of their environment. The study further finds that the stronger the agent’s performance, the more precise its world model must be.”AI agents differ from large language models in that they need to continuously make decisions to achieve their ultimate goals. To make decisions, they require a clear understanding of what will happen after taking an action; otherwise, they either delay decision-making or make random choices.Currently, companies including Google, as mentioned above, and Meta have proposed new insights into agent training: using early experiences to ‘mid-train’ agents.Note: In October 2025, a paper titled ‘Agent Learning via Early Experience’ published by Meta and other institutions proposed a third approach—’mid-training’ paradigm.Returning to the topic of world models, a world model refers to the internal representation of an agent’s understanding of the operational rules of its environment, allowing it to ‘imagine/predict’ the future: given the current state and the intended action, it can estimate what the next state will be, what results will occur, and plan and make decisions accordingly.In Google’s paper, the environment is characterized as a (controllable) Markov process, with the core being the state-action→next state transition function P(s′|s,a). The ‘world model’ is explicitly defined as a computable approximation of the environment’s transition function P (or conditional transition P(s′|s,a)); the research provides algorithms to recover/extract this transition function from the agent’s behavior/policy. When the goal is multi-step and broadly target-oriented, there is no ‘model-free shortcut’ to bypass world modeling; performance improvement necessarily accompanies more precise learning and representation of the transition function.For example, the app ‘Xiao Mei’ released by Meituan in September attempts to build an ‘action world model’ focused on local life services: using executable state-action-result representations to compress user preferences, merchants and inventory, time-space constraints, platform rules, etc., into a decision core that can be used for prediction, evaluation, and planning, continuously corrected and executed through a dialogue loop.Using this as an example, I will explore how the dedicated world model for Xiao Mei’s food ordering works:1. State: Integrating user profiles and long-term preferences (taste, budget, allergies/restrictions, daily routine), context (current city, time, weather, location), merchants and inventory (operating/closed, minimum order price, delivery fee, stock, remaining tickets), and historical interactions and feedback (positive/negative reviews, repurchase, tendency to modify notes). This account information is interconnected with the user information of the main Meituan app.2. Action: A callable set of atomic capabilities and combination strategies, such as ‘retrieve candidates’, ‘rearrange by constraints’, ‘generate alternative lists’, ‘initiate order/reservation/navigation’, ‘change address/note/time’, ‘cancel/refund guidance’, etc.3. Transition T(s,a)→s′: Probabilistic estimates of ‘when can the order be picked up/delivered’, ‘whether the address change can still be fulfilled’, ‘whether the promotion has expired’, ‘whether the inventory is sold out’, etc., and updates the next state accordingly.4. Reward R: Trainable/configurable rewards aimed at multi-objective trade-offs, such as minimizing delivery delays, fulfillment success rates, user satisfaction/repurchase, cost/discount utilization, compliance, and risk control.5. Planning Horizon: From ‘immediate single-step’ (changing address, checking remaining tickets) to ‘multi-step tasks’ (‘weekly breakfast arrangements’, ‘weekend family dinner + parking’), breaking long tasks into verifiable sub-goals and checkpoints through dialogue.6. Tool Usage: Directly calling native capabilities for food delivery, dining, travel, hotel booking, etc., through built-in interfaces, forming a ‘dialogue internal loop’ execution, reducing transitions and information loss.7. Context Boundaries: Currently more adept in local life contexts; still limited in scenarios like dialect recognition and complex cross-App links, requiring user guidance or reverting to manual confirmation.After this loop is completed, it can certainly learn and remember: consolidating repurchase patterns, time preferences, note habits, for default recommendations and auto-filling in the next instance.Relatively speaking, this is a very simple agent with clear expected outcomes. Future multifunctional agents will need to complete complex long-term tasks, requiring a series of interrelated decisions to achieve their goals in dynamic environments. This demands that they not only know ‘what to do’ but also understand ‘why to do it’ and ‘what will happen by doing it’.Traditional AI training is a two-stage process of ‘pre-training + fine-tuning’. However, for agents that need to interact deeply with the world, we may require a three-stage process of ‘pre-training + mid-training + post-training’: * Pre-training phase: Learning language and knowledge, mastering basic capabilities* Mid-training phase: Understanding the operational rules of the world, establishing causal models* Post-training phase: Optimizing strategies and goals in specific environmentsThis three-stage training paradigm may be the necessary path to achieving truly general agents. To train agents well, we cannot rush; we must give them time and opportunities to first understand the world and then change it. Let agents grow from passive imitators to active explorers, ultimately transforming into intelligent decision-makers.The above information is partially sourced from:Stop competing for rankings! The next battlefield for AI Agents is ‘mid-training’ | Interpretation of Meta’s latest paperLink to Meta’s paper:[2510.08558] Agent Learning via Early Experience

Leave a Comment