Do We Need Multi-Agents?

This article compares and analyzes the viewpoints and methods of “Don’t Build Multi-Agents” and “How We Built Our Multi-Agent Research System.” Both articles express their opinions and arguments on the question of whether to create Multi-Agents.

1. Main Points and Arguments of “Don’t Build Multi-Agents”

Main Argument: Walden Yan argues in “Don’t Build Multi-Agents” that the current Agent framework of large language models (LLMs) is not satisfactory. Based on his practical experience, he summarizes a series of principles for building AI agents and emphasizes that certain superficially attractive ideas (especially multi-agent architectures) are very fragile in practice and often do not yield ideal results. His core claim is that to ensure reliability, it is best to avoid having multiple agents collaborate simultaneously on tasks and instead focus on improving the context management and decision consistency of a single agent.

Core Principles: Walden proposes two principles of “context engineering”: (1) shared context (actions for each sub-task should be conducted within a complete context); (2) actions contain implicit decisions, and conflicting decisions will lead to adverse outcomes. He believes these two principles are crucial, and architectures that do not adhere to them (such as splitting tasks among multiple agents lacking shared context) should be excluded by default. In other words, any design that leads to disjointed decision-making among sub-agents will significantly reduce system reliability. To illustrate the potential issues of violating these principles, Walden constructs a typical scenario: in a complex task (such as game development), if the task is divided among different sub-agents, each may produce incompatible parts due to misunderstandings, making it difficult for the main agent to integrate them. He points out that even if all sub-agents are given the original task description, the details of multi-turn dialogues and tool calls in real systems may still lead to misunderstandings, making it impossible to completely eliminate such discrepancies.

Arguments and Examples: To support his viewpoint, Walden provides several examples and lessons learned to demonstrate the problems with multi-agent architectures:

  • Misunderstanding in the Flappy Bird Task: In a hypothetical example of building a “Flappy Bird” game, the main agent divides the task into two sub-tasks for sub-agents to complete. However, one sub-agent mistakenly designs the background in a different style, while the other sub-agent generates a character that does not meet expectations, ultimately forcing the main agent to deal with two inconsistent styles, making it difficult to combine them into a complete work. This example vividly illustrates the problem of parallel sub-agents acting independently leading to inconsistent results.

  • Simplified Design of Claude Code: In practice, Anthropic’s Claude Code programming assistant reflects the industry’s cautious approach: although Claude Code can generate sub-agents to perform sub-tasks, it never runs multiple sub-agents in parallel, and sub-agents are only used to answer specific questions, not to directly produce code. The reason is that sub-agents cannot obtain the complete programming context from the main agent, making it difficult to handle complex creative tasks. If multiple sub-agents were to run in parallel, conflicting answers could arise, leading to the same reliability issues mentioned earlier. Therefore, the designers of Claude Code intentionally maintain a simple and linear architecture to avoid the risks of inconsistency brought by multiple agents.

  • Experience with the “Devin” Code Editing Model: In 2024, many code generation tools (IDE plugins, application generators, etc.) attempted a two-step scheme of “large model + small model” collaboration (for example, the Devin system): a large model generates modification instructions, which are then passed to a small model to edit the code based on those instructions. This “edit-apply” mode was initially designed to address the unreliability of large models directly outputting code patches. However, practice has shown that this type of multi-model collaboration is prone to errors: small models often misinterpret the intentions of the large model due to subtle ambiguities in the instructions, leading to incorrect modifications. With the emergence of more powerful models (such as GPT-4), developers have found that having a single model directly execute changes is often more reliable, and such editing tasks are now more frequently completed by a single model in one interaction. The Devin case highlights that when a single agent’s capabilities are strong enough, multi-agent division of labor may be superfluous.

Solution Approach: Walden suggests adopting a single-threaded, single-agent system to ensure continuity in task execution and proposes introducing an auxiliary model to compress long dialogue history to address the limitations of context windows. He believes that until the challenges of cross-agent context transfer are resolved, current multi-agent systems are fragile, and predicts that as the capabilities of single agents improve, this issue will naturally resolve, thereby unlocking the potential for multi-agent collaboration. Therefore, he advocates focusing on perfecting single-agent systems at this stage before transitioning to multi-agent modes.

2. Main Points and Arguments of “How We Built Our Multi-Agent Research System”

Main Argument: The Anthropic team shares their practical experiences and viewpoints in “How We Built Our Multi-Agent Research System” regarding the development of a multi-agent research system. Unlike Walden’s conservative stance, the Anthropic team argues that multi-agent architectures can significantly enhance performance on specific types of complex tasks, despite incurring higher resource consumption and implementation complexity. They use open-ended information retrieval/research tasks as application scenarios to demonstrate that a well-designed multi-agent system can surpass the capabilities of a single agent, generating actual value within a reasonable cost-benefit balance.

Evidence of Performance Improvement: Anthropic supports its argument with internal experimental data. The results show that for complex queries requiring browsing and gathering a large amount of information, multi-agent systems perform far better than single-agent systems. For example, in an internal evaluation, they tasked the Claude model with answering a comprehensive question: “Identify all board members of companies in the S&P 500 Information Technology sector.” The multi-agent system, using Claude-Opus 4 as the main agent and multiple Claude-Sonnet 4 as sub-agents, successfully found the correct answer by breaking the task down and having different sub-agents search in parallel; whereas the single Claude-Opus 4 agent, limited to slow sequential searches, failed to find the complete answer. Overall assessments indicated that the multi-agent system improved performance on such tasks by approximately 90.2%. The Anthropic team analyzes that the effectiveness of the multi-agent system primarily stems from its ability to allocate sufficient tokens to explore the problem space. In their BrowseComp benchmark (used to evaluate agents’ abilities to browse the web for scarce information), three factors account for nearly 95% of the performance difference: token consumption is the most significant factor (accounting for about 80% of the difference), followed by tool invocation frequency and the choice of underlying models. This indicates that the multi-agent architecture effectively expands the system’s total “thinking” capacity by allowing multiple sub-agents to utilize independent context windows and reason in parallel. In simple terms, multi-agents allow the model to “spend” more tokens and steps, thereby breaking through the limitations of a single model’s context at a time. The Anthropic team views this capability as a scalability advantage when tasks exceed the processing limits of a single agent: multiple agents effectively expand the effective context and parallel reasoning depth, enabling the resolution of complex problems that a single agent cannot handle.

Costs and Limitations: Although multi-agents perform excellently, Anthropic candidly discusses their costs and limitations.

  • Resource Consumption: Multi-agent architectures consume significantly more tokens and computation in practice. Data shows that when a single agent completes a task, it typically consumes about four times the tokens compared to direct user interaction, while systems with multiple agents collaborating can exceed 15 times the tokens of normal chat. Such high overhead means that economic feasibility becomes a major constraint: only those scenarios where the task value is sufficiently high and the quality of results is critically important are worth the computational cost required for multi-agent systems.

  • Task Type Limitations: Not all tasks are suitable for parallel processing by multiple agents. If the parts of a task are highly coupled or dependent on a unified global state, coordinating multiple agents under current technology can be more challenging. Anthropic points out that, for example, code writing/modification is one such task that is not suitable for multi-agent collaboration—sub-tasks in programming often have sequential dependencies, making true parallelism difficult, and current LLM agents are not adept at real-time coordination and delegation of work.

In contrast, they find that the multi-agent architecture excels in areas where tasks can be clearly divided into parallel sub-tasks, where the amount of information to be processed exceeds the capacity of a single context window, and where numerous complex tools need to be invoked. Under these conditions, the advantages of multi-agent systems can be fully realized; conversely, forcing multi-agents into tasks with highly interrelated information and steps not only yields minimal results but may also introduce unnecessary complexity and errors.

Anthropic emphasizes that deploying multi-agent systems in production is highly challenging, as small errors can be magnified. To ensure reliability, they adopt a strategy of “rigorous engineering + full-stack collaboration,” which includes strict testing, carefully designed interfaces, and robust monitoring and backup mechanisms. This well-engineered system has proven successful in creating value for users, helping them solve complex problems and saving significant time, demonstrating its potential to surpass single-agent systems.

To optimize multi-agent systems, Anthropic has introduced advanced strategies. For long dialogues, the system manages context by having agents generate periodic summaries and utilize external storage to avoid information overload. To reduce distortion of information during transmission, the system allows sub-agents to write their output results (such as code) directly into shared files. These designs cleverly address the core issues of context sharing and information loss, making multi-agent systems more efficient and reliable.

3. Consensus and Differences Between the Two Articles

3.1 Consensus and Common Progress

Despite clear differences in conclusions, a careful analysis reveals that both articles share many fundamental understandings:

  • Reliability and Error Control: Both parties recognize that AI agents have low fault tolerance, thus viewing “controlling error accumulation” to enhance system stability as a core objective.

  • Context Continuity and Sharing: Both agree that maintaining continuity and sharing of context is a key principle in building reliable agents. Anthropic has implemented the “context engineering” advocated by Walden through memory mechanisms and other means.

  • Avoiding Decision Conflicts: To prevent decision conflicts among multiple agents, Walden tends to avoid parallel decision-making, while Anthropic has designed a clear hierarchy, with the main agent making unified decisions to fundamentally reduce conflicts.

  • Solutions for Long Context Tasks: In addressing the challenges of long contexts, both have adopted “summarization/memory” mechanisms. By intelligently compressing historical information to extend the model’s effective memory, this reflects the technical trend from “prompt engineering” to “context engineering”.

Consensus on the Applicability of Multi-Agents: Interestingly, both articles nearly agree on the situations where multi-agents are not suitable. Walden emphasizes through the example of Claude Code that in tasks requiring unified style or sharing a large amount of state (such as programming), introducing parallel sub-agents can lead to confusion due to lack of shared context, which is why Claude Code intentionally does not adopt parallel multi-sub-agents (note: it seems to have recently introduced a multi-threaded mode, which has not yet been tried). Anthropic also clearly states that most coding tasks are currently not suitable for multi-agents, as these tasks are difficult to truly decompose into independent sub-tasks, requiring agents to share a large amount of context and coordinate in real-time, which exceeds the current capabilities of multi-agents. Both agree that when tasks are inherently non-parallel or highly dependent on unified context, a single-agent linear process should be adopted. In other words, for scenarios like code generation/modification, current industry practice tends to favor single-agent serial execution (with a sufficiently long context window or step-by-step questioning), which is a consensus reached in both articles.

3.2 Controversies and Differences

Despite the aforementioned consensus, the differences in viewpoints and methods between the two articles remain quite pronounced, mainly reflected in the following aspects:

  • Radically Different Attitudes Toward Multi-Agent Architectures: Walden Yan holds a negative and cautious stance on multi-agents. He bluntly states that having multiple agents collaborate in 2025 “will only produce fragile systems” due to decentralized decision-making and insufficient context sharing. His recommendation is to “default to avoiding” any architecture that violates the principles of context continuity and consistency. In contrast, the Anthropic team exhibits a more positive and exploratory attitude toward multi-agents: they have not only built multi-agent systems but also provided experimental data demonstrating their significant advantages in specific tasks. At the end of the article, Anthropic describes how their system has already helped users solve real problems. Thus, a clear divergence is that Walden emphasizes the current risks and drawbacks of multi-agents, while Anthropic highlights the benefits and prospects of multi-agents in certain scenarios. It is important to note that Anthropic is not blindly optimistic about multi-agents—they also enumerate the limitations and applicable conditions of multi-agents and do not claim to replace the role of single agents in all areas.

  • Differences in Architectural Design Trade-offs: Walden advocates for a simplified architecture, using a powerful agent to complete tasks from start to finish, supplemented by necessary context enhancement techniques, rather than introducing parallel second agents. This reflects his advocacy for a “single agent + context compression” model. Anthropic, on the other hand, opts for a multi-agent + centralized coordination architecture, with multiple sub-agents working in parallel under a main agent. Although Anthropic also employs summarization and memory to address context issues, fundamentally they expand capacity by increasing the number of agents. In contrast, Walden expands capacity by enhancing the capabilities of a single agent. Both approaches have their trade-offs: the former has a simple architecture and strong consistency but may be limited by single model performance; the latter is parallel and efficient but increases coordination costs and implementation complexity.

  • Attitudes Toward Complexity: Walden and Anthropic have different trade-offs in engineering philosophy. Walden adheres to a philosophy of “simplicity is robustness”: he prefers to retain some functionality while avoiding the introduction of complex architectures that could lead to uncontrollable errors. His focus is on maintaining system determinism and understandability, even at the cost of some efficiency gained from parallelism. Anthropic, on the other hand, shows a tendency to embrace complexity for performance: they invest significant engineering resources to manage the complexity of multi-agents, as long as the final system performance and user value are significantly enhanced.

4. Current Situation Analysis

Looking at the most successful AI application cases in the industry today (such as OpenAI’s ChatGPT/GPT-4, Anthropic’s Claude, and the code assistant “Devin”), we can more clearly see which approach aligns better with practical realities and the implications of the aforementioned consensus and differences for industry development.

Mainstream Product Paradigm: Single Agent Dominates. As of 2025, most mature AI products on the market still adopt a single large model as the agent to interact with users, leveraging plugins/tools to extend capabilities rather than relying on multiple agents collaborating. Taking OpenAI’s ChatGPT and its underlying model GPT-4 as an example: it processes dialogues through a large pre-trained model, and when needed, invokes external tools (such as database queries, code execution) also by the same model based on prompts, maintaining continuity in the thought chain of a single agent throughout the process. This architectural choice aligns with the principles advocated by Walden Yan—centering around a powerful agent while ensuring context consistency and a simple decision chain. The rapid large-scale deployment of ChatGPT is attributed to the single-agent paradigm reducing uncertainty, facilitating tuning through training and human feedback (RLHF), making its behavior predictable and highly consistent. In practical engineering, introducing more agents means more uncertainties, which conflicts with the demands for stability in internet products. Therefore, it can be said that Walden’s approach is validated in current mainstream AI products: there is a tendency to refine a single model to perfection rather than combine multiple imperfect models.

Current engineering practices lean towards robust single-agent solutions while viewing multi-agent systems as an exploration for the future. The future trend is a fusion of both, where single agents will internalize multi-agent capabilities, and multi-agent architectures will mature. The best strategy at this stage is to make choices based on specific application needs, balancing between mature single-agent paradigms and cutting-edge multi-agent collaboration.

5. My Perspective

From my usage experience, currently well-performing agent products like OpenAI DeepResearch (ODR) and Claude Code are single-agent products, whose capabilities stem from strong foundational model capabilities or fine-tuning in specific domains (ODR). In contrast, the multi-agent Claude DeepResearch, while fast, produces reports of poor quality. This may partly be due to multi-path queries leading to a lack of context sharing among sub-tasks, in stark contrast to ODR, which can access all current contexts and decide on the next action based on that.

The evolution of large models is very rapid, and it is likely that carefully crafted frameworks and prompts will be overshadowed in the next round of model updates. Thus, using a single agent directly for a single task is currently the most cost-effective way to achieve better results with minimal costs as model capabilities improve.

Currently, large models are still in a rapid development phase, and their capabilities have not yet reached a bottleneck, so the single-agent model remains mainstream. However, under the constraints of data and computing power, the performance improvement of single models will eventually have an upper limit, at which point multi-agents may come into play. Interestingly, large model vendors like Anthropic are actively seeking growth avenues beyond model capabilities, while AI application companies like “Devin” firmly believe in the development of large model capabilities, indicating that different companies have faced significant challenges on the paths they chose early on.

Leave a Comment