1. Paper Information:
- Paper Title: Why Do Multi-Agent LLM Systems Fail?
- Authors: Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica
- Link: https://arxiv.org/abs/2503.13657
2. Main Content of the Paper:
1. Abstract
- Despite the increasing attention on Multi-Agent Systems (MAS), where multiple agents based on Large Language Models (LLMs) collaborate to complete tasks, their performance improvements on popular benchmarks remain trivial compared to single-agent frameworks. This indicates a need to analyze the challenges that hinder the effectiveness of MAS. This paper provides a comprehensive study of MAS challenges for the first time, analyzing five popular MAS frameworks involving over 150 tasks, with the participation of six expert human annotators. The study identifies 14 unique failure modes and proposes a comprehensive classification system applicable to various MAS frameworks. This classification system was generated through consensus iterations among three expert annotators, achieving a Cohen’s Kappa score of 0.88. These fine-grained failure modes are categorized into three groups: (i) specification and system design failures, (ii) agent misalignment, and (iii) task validation and termination. To support scalable evaluation, the study integrates MASFT with LLM-as-a-Judge. The research also explores whether the identified failures can be easily prevented by proposing two interventions: improving the specifications of agent roles and enhancing orchestration strategies. The findings indicate that the identified failures require more complex solutions, providing a clear roadmap for future research. The study also open-sourced the dataset and LLM annotator.
2. Research Background and Core Content
- In recent years, LLM-based agent systems have gained widespread attention in the AI community for their ability to handle complex multi-step tasks while dynamically interacting with diverse environments. Multi-Agent Systems (MAS) have been increasingly explored in fields such as software engineering, drug discovery, and scientific simulation. However, despite the growing adoption of MAS, their accuracy and performance improvements compared to single-agent frameworks remain limited. The core question of the research is: why do MAS fail? To answer this question, the researchers employed the Grounded Theory method to systematically evaluate five popular open-source MAS, analyzing over 150 dialogue traces and identifying 14 unique failure modes, which were categorized into three main categories, proposing the first structured MAS failure taxonomy (MASFT). The study also developed a scalable LLM-as-a-Judge evaluation process for analyzing new MAS performance and diagnosing failure modes.
3. Methodological Analysis and Main Contributions
- The study employed the Grounded Theory method, constructing theory directly from empirical data through theoretical sampling, open coding, constant comparative analysis, memo writing, and theory building, rather than testing predefined hypotheses. The researchers collected and analyzed MAS execution traces, iteratively identifying failure modes and validating and refining the classification system through consensus among three expert annotators. The main contributions include:
- Proposing the first empirically-based MAS failure classification system (MASFT), providing a structured framework for understanding and mitigating MAS failures.
- Developing a scalable LLM-as-a-Judge evaluation process for analyzing new MAS performance and diagnosing failure modes.
- Through case studies, exploring interventions for agent specifications, dialogue management, and validation strategies, which, despite achieving a 14% improvement in task completion, failed to fully resolve MAS failures, highlighting the need for structural redesign of MAS.
- Open-sourcing all 150+ annotated MAS dialogue traces, the scalable LLM-as-a-Judge evaluation process, and detailed expert annotations for 15 selected traces.
4. Practical Work
- The researchers first defined the concepts of LLM-based agents and Multi-Agent Systems, selecting five popular open-source MAS for analysis. These systems include MetaGPT, ChatDev, HyperAgent, AppWorld, and AG2. The researchers collected execution traces from these systems and annotated over 150 dialogue traces with the help of six expert human annotators. During the theoretical sampling phase, the researchers selected tasks based on the diversity of the systems, ensuring coverage of different objectives, organizational structures, implementation methods, and agent roles. In the open coding phase, annotators identified failure modes and iteratively refined the definitions of failure modes through constant comparative analysis. In the inter-annotator agreement study, the researchers gradually adjusted the definitions of failure modes and the classification system through multiple rounds of discussion and annotation until achieving stable consensus, with a Cohen’s Kappa score of 0.88. Finally, the researchers developed an LLM-based annotator for automated failure identification and validated its reliability and effectiveness.
5. Experimental Data and Research Findings
- The researchers analyzed five popular MAS frameworks and found a high failure rate. For example, ChatDev’s accuracy was only 25%. Using the Grounded Theory method, the researchers identified 14 unique failure modes, categorizing them into three main groups: specification and system design failures (37.2%), agent misalignment (31.4%), and task validation and termination (31.4%). These failure modes are distributed differently across various MAS, indicating that different systems have their own strengths and weaknesses. For instance, AG2 has fewer issues with agent misalignment, while ChatDev faces more challenges with specifications and agent misalignment. The researchers also developed an LLM-as-a-Judge pipeline, achieving an accuracy of 94% and a Cohen’s Kappa value of 0.77, demonstrating its reliability. In case studies, the researchers implemented interventions for AG2 and ChatDev, including improved prompts and optimized agent topologies. For AG2, the improved prompts increased accuracy from 84.75% to 89.75% on GPT-4, and from 84.25% to 89.00% on GPT-4o. For ChatDev, the improved role prompts and topology increased accuracy from 25.0% to 40.6% on the ProgramDev task, and from 89.6% to 91.5% on the HumanEval task. However, these improvements were still insufficient to meet real-world deployment requirements, indicating a need for more complex solutions.
3. Figures and Tables:
Figure 1: Failure Rates of Five Popular Multi-Agent LLM Systems
- Source: Chapter 1 “Introduction”
- Content Description: Figure 1 visually presents the failure rates of five popular Multi-Agent LLM systems (AG2, AppWorld, ChatDev, HyperAgent, MetaGPT) when using GPT-4o and Claude-3, through a bar chart illustrating the success and failure ratios of each system.
- Key Findings: The differences in failure rates among different systems indicate that the performance of MAS depends not only on the underlying LLM models but also closely relates to system design and architecture, with ChatDev exhibiting a relatively high failure rate, suggesting more areas for optimization in system design.
Figure 2: MAS Failure Mode Classification System
- Source: Chapter 4 “Study Findings”
- Content Description: Figure 2 presents the MASFT (Multi-Agent System Failure Taxonomy) classification system, categorizing 14 failure modes into three main groups: specification and system design failures, agent misalignment, and task validation and termination, and showing the distribution of each failure mode across the stages of MAS execution.
- Key Findings: The distribution of failure modes across different categories is relatively even, indicating that MAS failures are multifaceted and require comprehensive consideration of system design, agent collaboration, and task validation at multiple stages, emphasizing the importance of monitoring and management throughout the process.
Figure 3: System Research Methodology Workflow
- Source: Chapter 3 “Study Methodology”
- Content Description: Figure 3 describes the methodological workflow for studying MAS failure modes, including steps for failure identification, classification development, and inter-annotator agreement studies, showcasing the entire iterative process from data collection to the finalization of the classification system.
- Key Findings: This workflow emphasizes ensuring the reliability and effectiveness of the classification system through consensus iterations among expert annotators, providing a solid methodological foundation for subsequent failure mode analysis and solution design, with inter-annotator agreement studies being a key step in verifying the non-ambiguity of the classification system, ensuring the credibility of the research results.
Figure 4: Distribution of Failure Modes in Different Systems
- Source: Chapter 4 “Study Findings”
- Content Description: Figure 4 presents the distribution of various failure modes across different MAS systems in a bar chart format, with each system’s bar chart color-coded by failure category and further subdivided into different failure modes.
- Key Findings: There are significant differences in the distribution of failure modes among different systems, for example, AG2 encounters fewer failures related to specifications and validation issues, while it faces more problems with agent misalignment, reflecting differences in architecture and task allocation among different systems, with ChatDev facing more prominent challenges in specifications and agent misalignment, indicating the need for targeted optimization strategies.
Figure 5: Dialogue Example Between Phone Agent and Supervisor Agent
- Source: Chapter 4 “Study Findings”
- Content Description: Figure 5 presents a specific dialogue example where the Phone Agent failed to communicate the API specifications and login username requirements to the Supervisor Agent, leading to a dialogue failure, with the dialogue content presented in a code block format clearly showing the breakdown in information transfer.
- Key Findings: Poor information transfer is a common failure mode in MAS, especially in tasks involving multiple agents that require sharing critical information, where a lack of effective communication mechanisms between agents may lead to task failures, emphasizing the importance of designing standardized communication protocols.
Figure 6: Correlation Matrix of MAS Failure Categories
- Source: Chapter 4 “Study Findings”
- Content Description: Figure 6 presents the correlation matrix of MAS failure categories, using a heatmap to show the correlations between different failure categories, with values ranging from 0 to 1 indicating the strength of correlation.
- Key Findings: The correlations between failure categories are not particularly strong, indicating that MAS failures are typically the result of multiple factors rather than a single cause, and this diversity requires comprehensive consideration of multiple aspects in solution design, rather than optimizing for just one type of failure.
Figure 7: Correlation Matrix of MAS Failure Modes
- Source: Chapter 4 “Study Findings”
- Content Description: Figure 7 presents the correlation matrix of MAS failure modes, detailing the correlations between the 14 failure modes, visually represented through values and heatmaps, helping to understand the potential connections between different failure modes.
- Key Findings: Some failure modes exhibit high correlations, indicating they may stem from similar system design issues or agent interaction patterns, providing a basis for identifying and preventing potential combinations of failures, aiding in more comprehensive risk assessment and optimization during the system design phase.
Table 1: List of MAS Systems Analyzed in the Study
- Source: Chapter 3 “Data Collection and Analysis”
- Content Description: Table 1 lists the five MAS systems analyzed in the study, including MetaGPT, ChatDev, HyperAgent, AppWorld, and AG2, along with the architecture and objectives of each system.
- Key Findings: These systems cover a variety of fields and application scenarios, providing a rich data foundation for subsequent failure mode analysis.
Table 2: Performance of LLM-as-a-Judge Pipeline
- Source: Chapter 3 “LLM Annotator”
- Content Description: Table 2 presents the performance metrics of the LLM-as-a-Judge pipeline, including accuracy, recall, precision, F1 score, and Cohen’s κ value, for both the o1 model and the o1 (few shot) model.
- Key Findings: The accuracy of the o1 (few shot) model reached 94%, with a Cohen’s κ value of 0.77, indicating high reliability and effectiveness in failure identification and classification.
Table 3: Mapping of Solution Strategies to Failure Categories
- Source: Chapter 5 “Towards better Multi-Agent LLM Systems”
- Content Description: Table 3 lists the proposed solution strategies for different failure categories, including tactical approaches and structural strategies, providing specific guidance for improving MAS design.
- Key Findings: Different failure categories require different solutions, with structural strategies generally being more effective than tactical approaches, though they are also more challenging to implement.
Table 4: Accuracy Comparison of Case Studies
- Source: Chapter 6 “Case Studies”
- Content Description: Table 4 presents the task completion accuracy under different configurations in two case studies, including baseline implementations, improved prompts, and new topological structures, specifically for AG2 and ChatDev systems.
- Key Findings: Improved prompts and new topological structures can significantly enhance performance in certain cases, but the effectiveness varies by system and task, indicating a need for deeper structural improvements to comprehensively enhance the reliability of MAS.
4. References:
1. Research Related to Multi-Agent Systems (MAS):
-
Representative Literature:
- “Multi-Agent Systems studied with human-annotated traces”: This literature provides a detailed introduction to multiple human-annotated multi-agent systems, offering a rich case foundation for this research, aiding in the identification and analysis of failure modes in MAS.
- “Benchmark: Benchmarking multi-agent reinforcement learning”: This paper proposes a benchmarking platform for multi-agent reinforcement learning, which is of significant reference value for evaluating the performance and reliability of MAS.
- “Harnessing language for coordination: A framework and benchmark for LLM-driven multi-agent control”: This paper explores how to utilize language for multi-agent coordination, providing a framework and benchmark that offers new insights for the design and optimization of MAS.
-
Core Summary:
- This type of literature mainly focuses on the architecture, performance evaluation, and practical applications of multi-agent systems, providing empirical data and theoretical support for improving system design and enhancing reliability through the analysis of different MAS performances across various tasks.
2. LLM Agent System Design and Optimization:
-
Representative Literature:
- “Specifications: The missing link to making the development of LLM systems an engineering discipline”: This paper emphasizes the importance of specifications in LLM system development, pointing out that clear specifications are key to transforming LLM system development into an engineering discipline.
- “Building effective agents”: A blog post by Anthropic sharing experiences and strategies for building efficient agents, such as modular components, prompt chains, and routing.
- “Are more LLM calls all you need? Towards scaling laws of compound inference systems”: This paper explores the feasibility of improving the performance of compound inference systems by increasing the number of LLM calls, providing quantitative analysis for optimizing agent systems.
-
Core Summary:
- This literature focuses on the development and optimization of LLM agent systems, discussing how to enhance agent performance and reliability through specification design, modular components, and reasonable system architecture from both engineering practice and theoretical analysis perspectives.
3. Research on Failure Modes and Reliability of Agent Systems:
-
Representative Literature:
- “Challenges in human-agent communication”: This paper provides an in-depth analysis of the challenges in communication between humans and agents, offering important references for understanding failure modes in MAS.
- “Failures Taxonomization in LLM Systems”: Similar to this research, this paper aims to classify failures in LLM systems, providing methodological support for identifying and addressing MAS failures.
- “Agent workflow memory”: This paper introduces the concept of agent workflow memory, exploring how memory management can enhance agent performance in long-sequence tasks, providing insights for addressing issues like context loss in MAS.
-
Core Summary:
- This type of literature focuses on failure identification, classification, and reliability enhancement in agent systems, proposing various strategies and methods for improving system reliability through the analysis of different types of failures and their causes.
4. High Reliability Organization (HRO) Theory and Practice:
-
Representative Literature:
- “Normal Accidents: Living with High-Risk Technologies”: This paper discusses normal accidents in high-risk technology systems, emphasizing the impact of organizational structure on system reliability.
- “New challenges in organizational research: High reliability organizations”: This paper analyzes the new challenges faced by high reliability organizations, providing organizational-level insights for the design and management of MAS.
- “Reliable organizations: Present research and future directions”: This paper reviews the current state of research on reliable organizations and prospects for future research directions, offering organizational management perspectives for enhancing MAS reliability.
-
Core Summary:
- This literature explores how to enhance the reliability of complex systems through reasonable organizational structures and management strategies from the perspective of organizational theory, providing important theoretical references for the design and optimization of MAS.
5. LLM Technology and Applications:
-
Representative Literature:
- “Training verifiers to solve math word problems”: This paper studies how to train verifiers to solve math word problems, providing technical guidance for the application of LLMs in specific tasks.
- “Gorilla: Large language model connected with massive APIs”: This paper introduces the Gorilla system, demonstrating how to connect large language models with numerous APIs, expanding the application scenarios of LLMs.
- “MemGPT: Towards LLMs as operating systems”: This paper proposes the MemGPT framework, aiming to evolve LLMs into operating systems, providing new directions for the technological evolution and application expansion of LLMs.
-
Core Summary:
- This literature mainly introduces the latest advancements in LLM technology and its applications in various fields, providing technical support and innovative ideas for MAS research and development from both technical implementation and application expansion perspectives.