Why Do Multi-Agent LLM Systems Fail?

Multi-Agent Large Language Model (LLM) systems can fail? Recently, the University of California, Berkeley published a significant paper titled “Why Do Multi-Agent LLM Systems Fail?” which delves into the reasons for the failures of MAS systems, outlining 14 specific failure modes and providing corresponding improvement suggestions.

Below is the translation of the paper, Enjoy.

Introduction

Despite the growing enthusiasm for Multi-Agent Systems (MAS), where multiple LLM agents collaborate to complete tasks, their performance improvements on popular benchmarks remain minimal compared to single-agent frameworks. This gap highlights the necessity of analyzing the challenges that hinder the effectiveness of MAS.

In this paper, we conduct a comprehensive study of the challenges faced by MAS for the first time. We analyze five popular MAS frameworks, covering over 150 tasks, involving six expert annotators. We identify 14 unique failure modes and propose a comprehensive taxonomy applicable to various MAS frameworks. This taxonomy emerged iteratively from the consensus among three expert annotators in each study, achieving a Cohen’s Kappa score of 0.88. These granular failure modes are categorized into three classes: (i) specification and system design failures, (ii) misalignment among agents, and (iii) task validation and termination. To support scalable evaluation, we integrate MASFT with LLM-as-a-Judge.

Why Do Multi-Agent LLM Systems Fail?

“Successful systems are all alike; every failed system fails in its own way.” (Berkeley, 2025)

Why Do Multi-Agent LLM Systems Fail? Recently, agent systems based on large language models (LLM) have garnered widespread attention in the AI community. This growing interest stems from the ability of agent systems to handle complex multi-step tasks while dynamically interacting with different environments, making LLM-based agent systems particularly suitable for solving real-world problems. Based on this characteristic, multi-agent systems have seen increasing exploration across various fields such as software engineering, drug discovery, scientific simulation, and more recently, general agents.Figure 1:Failure rates of five popular multi-agent LLM systems compared to GPT-4o and Claude-3.In this study, we define LLM-based agents as artificial entities capable of interacting with the environment (e.g., tool usage) through prompt specifications (initial state), dialogue tracking (state), and their ability to interact. We then define Multi-Agent Systems (MAS) as a collection of agents designed to interact through orchestration to achieve collective intelligence. The structure of MAS is intended to coordinate efforts, achieve task decomposition, performance parallelization, context isolation, specialized model integration, and diverse reasoning discussions.Figure 2: Classification of MAS failure modes.Despite the increasing adoption of MAS, the accuracy or performance improvements compared to single-agent frameworks or simple baselines (e.g., best N sampling in popular benchmarks) remain minimal. Our empirical analysis indicates that the correctness of the state-of-the-art (SOTA) open-source MAS ChatDev may be as low as 25%, as shown in Figure 1. Furthermore, there is currently no clear consensus on how to build robust and reliable MAS. This raises a fundamental question we first need to answer: Why do MAS fail?

To understand the failure modes of MAS, we conducted the first systematic evaluation of MAS execution trajectories using Grounded Theory (GT) (Glaser & Strauss, 1967). We analyzed five popular open-source MAS, employing six expert annotators to identify granular issues within 150 dialogue trajectories, each averaging over 15,000 lines of text. We define a failure as a situation where MAS fails to achieve the expected task goals. To ensure consistency in failure modes and definitions, three expert annotators independently labeled 15 trajectories, achieving inter-annotator consistency with a Cohen’s Kappa score of 0.88. Through this comprehensive analysis, we identified 14 distinct failure modes and clustered them into three main failure categories. We introduce the Multi-Agent System Failure Taxonomy (MASFT), which is the first structured failure taxonomy for MAS, as shown in Figure 2. We do not claim that MASFT covers all potential failure modes; rather, it is the first step in classifying and understanding MAS failures.

To achieve scalable automated evaluation, we introduced the LLM-as-a-judge process using OpenAI’s o1. To validate this process, we cross-validated it with three human expert annotators on 10 trajectories, ultimately achieving a Cohen’s Kappa consistency rate of 0.77.

Intuitively, better specifications and prompt strategies may alleviate MAS failures. To test this hypothesis, we implemented interventions using prompt engineering and enhanced agent topology orchestration. Our case studies on AG2 and ChatDev indicate that while these interventions brought a +14% improvement to ChatDev, they did not resolve all failure scenarios. Moreover, the improved performance remains insufficient for practical deployment.

These findings suggest that the identified failures require more complex solutions, providing a clear roadmap for future research. We have open-sourced our dataset and LLM interpreter.

Although one might simply attribute these failures to the limitations of current LLMs (e.g., hallucinations, misalignment), we speculate that improvements in the underlying model capabilities are insufficient to address the complete MASFT. Instead, we argue that good MAS design requires organizational understanding—if the organizational structure is flawed, even organizations composed of experienced individuals can fail catastrophically.

Previous research on high-reliability organizations has shown that clear design principles can prevent such failures. Consistent with these theories, our findings indicate that many MAS failures stem from challenges in agent-to-agent interactions rather than limitations of individual agents. MASFT can systematically identify these failures and inform design principles for the next generation of MAS.

The contributions of this paper are as follows:

We introduce MASFT, the first empirically-based MAS failure taxonomy, providing a structured framework for understanding and mitigating MAS failures.
We developed a scalable LLM-as-a-judge evaluation process for analyzing new MAS performance and diagnosing failure modes.
We conducted intervention studies targeting agent specifications, dialogue management, and validation strategies. Although the task completion rate improved by 14%, they failed to fully resolve MAS failures, highlighting the need for structural MAS redesign.
We fully open-sourced: (1) all 150+ MAS dialogue trajectories with annotators, (2) the scalable LLM-as-a-judge evaluation process and LLM annotators on 150+ trajectories, and (3) detailed expert annotators on 15 selected trajectories.

2Related Work

2.1Challenges of Agent Systems

Good functionality of agent systems has spurred research into specific agent challenges. For example, Agent Workflow Memory addresses long-horizon network navigation problems by introducing workflow memory. DSPy and Agora tackle issues in communication flows, while StateFlow focuses on state control within agent workflows to enhance task-solving capabilities. While these works contribute meaningfully to specific use cases, they do not provide a comprehensive understanding of the reasons for MAS failures, nor do they propose strategies that can be broadly applied across various domains. Numerous benchmarks have been proposed to evaluate agent systems. These evaluations are crucial for identifying challenges and limitations within agent systems, but they primarily promote a top-down perspective, focusing on higher-level goals such as task performance, trustworthiness, safety, and privacy.

2.2Design Principles for Agent Systems

Some studies emphasize the challenges of building robust agent systems and propose new strategies (often used for single-agent design) to enhance reliability. For instance, Anthropic’s blog post highlights the importance of modular components (such as quick links and routing) rather than adopting overly complex frameworks. Similarly, Kapoor et al. indicate that complexity may hinder the application of agent systems in the real world. Our work extends these insights by systematically studying failure modes in MAS, providing a taxonomy that explains the reasons for MAS failures, and proposing solutions for agent system design that align with these insights.

2.3Failure Classification in LLM Systems

Despite the growing interest in LLM agents, dedicated studies on their failure modes are surprisingly limited. In contrast to Bansal et al.‘s research, which classifies challenges faced in human-agent interactions within agent systems, our contribution represents a pioneering effort to study MAS failure modes. This underscores the necessity for future research to develop robust evaluation metrics, identify common failure modes, and design mitigation strategies to enhance the reliability of MAS.

3Research Methodology

Figure 3: Systematic workflow for studying MAS, including identifying failure modes, developing taxonomies, and iteratively refining through inter-annotator consistency studies, achieving a Cohen’s Kappa score of 0.88.

This section describes our approach to identifying the main failure modes in MAS and establishing a structured classification of failure modes. Figure 3 outlines this workflow.

To systematically and impartially discover failure modes, we employed the Grounded Theory (GT) methodology, a qualitative research method that builds theory directly from empirical data rather than testing predefined hypotheses. The inductive nature of GT allows for the organic emergence of failure mode identification. We collected and analyzed MAS execution trajectories through theoretical sampling, open coding, constant comparative analysis, memoing, and theoretical iteration.

After obtaining MAS trajectories and discussing preliminary findings, we derived an initial taxonomy from the observed failure modes. To refine the taxonomy, we conducted inter-annotator consistency studies, iteratively adjusting failure modes and categories through adding, deleting, merging, splitting, or modifying definitions until consensus was reached. This process reflects a learning approach, where the taxonomy is continuously refined until stability is achieved, measured through inter-annotator consistency via Cohen’s Kappa score. Additionally, to enable automated failure identification, we developed an LLM-based annotator and validated its reliability.

3.1Data Collection and Analysis

We employed theoretical sampling to ensure the diversity of identified MAS and the task set (MAS execution trajectories) to collect data. This method guided the selection of MAS based on their objectives, organizational structures, implementation methods, and variations in underlying agent roles. For each MAS, the selected tasks represent the expected capabilities of the system rather than artificially challenging scenarios. For instance, if the system reports performance on a specific benchmark or dataset, we directly select tasks from those benchmarks. The analyzed MAS cover multiple domains and contexts, as described in Table 1 and Appendix B. After collecting MAS trajectories, we applied open coding to analyze the traces of agent-agent and agent-environment interactions. Open coding breaks qualitative data into labeled segments, allowing annotators to create new codes and memo observations, facilitating iterative reflection and collaboration among annotators. Specifically, annotators identify the failure modes they encounter and systematically compare their newly created codes with existing codes, a process also known as constant comparative analysis in GT. This iterative process of failure mode identification and open coding continued until we reached theoretical saturation, the point at which no new insights emerge from additional data. Through this process, annotators labeled over 150 traces across five MAS. Next, we grouped related open codes to reveal granular failure modes in the initial version of MASFT. Finally, we linked failure modes to form a classification of error categories, as shown in Figure 2. This process is represented in Figure 3 with points 1 and 2. After proposing the initial taxonomy, a critical question arose regarding the reliability of this taxonomy and how we could find an automated method to evaluate MAS failures. To address this, we conducted inter-annotator consistency studies, where three annotators aimed to validate, improve, and finalize the initially derived taxonomy.

3.2Inter-Annotator Consistency Studies and Iterative Improvements

Inter-annotator studies primarily focus on validating a given test or scoring criterion, so that when multiple different annotators evaluate the same set of test cases based on the same scoring criteria, they should arrive at the same conclusions. Although we initially derived a taxonomy based on the theoretical sampling and open coding explained in the previous section, it still needed to be validated for unambiguity.

For inter-annotator consistency, we conducted three main rounds of discussion based on the initially derived taxonomy. In the first round, we sampled five different MAS trajectories from the 150+ trajectories obtained from theoretical sampling, and three annotators evaluated these trajectories using the failure modes and definitions from the initial taxonomy. We observed that the consistency achieved by annotators in the first round was very weak, with a Cohen’s Kappa score of 0.24. Subsequently, these annotators improved the taxonomy. This involved iteratively changing the taxonomy until we reached consensus on whether each failure mode existed in a certain failure mode and whether it was present in all five collected trajectories. During the iterative improvements, we modified the definitions of failure modes as needed, breaking them down into multiple granular failure modes, merging different failure modes into a new failure mode, adding new failure modes, or removing failure modes from the taxonomy.

This process can be likened to a learning study, where different agents (this time human annotators) independently collect observations from a shared state space and share their findings with each other to reach consensus. Furthermore, to avoid the fallacy of using training data as test data, when we refined the study at the end of round 1, we tested the new inter-annotator consistency and taxonomy performance on another set of trajectories in round 2. In the next phase (round 2), we sampled another set of five trajectories, each from different MAS. Then, the annotators achieved good consistency on the first attempt, with an average Cohen’s Kappa score of 0.92. Encouraged by this, we entered round 3, where we sampled another set of five trajectories and again used the same final taxonomy for annotators, achieving an average Cohen’s Kappa score of 0.84. Note that Cohen’s Kappa scores above 0.8 are considered strong, and above 0.9 are regarded as nearly perfect alignment.

Inspired by the reliability of the taxonomy, we posed the following question: Can we devise an automated method to annotate trajectories so that developers or users can utilize this automated pipeline alongside our taxonomy to understand the reasons for their model failures? Therefore, we developed an automated MASFT annotator using the LLM-as-a-judge pipeline, which we will describe in section 3.3.

3.3LLM Annotator

After developing our taxonomy MASFT and completing the inter-annotator consistency studies, our goal was to devise an automated method to use our taxonomy to discover and diagnose failure modes in MAS trajectories. To this end, we developed an LLM-as-a-judge pipeline. In this strategy, we provide the LLM with a system prompt that includes the failure modes from our MASFT, their detailed explanations, and some examples of these failure modes. In this strategy, we decided to use OpenAI’s o1 model and tested both scenarios of not providing the aforementioned examples and providing examples. Based on the results of the third round of inter-annotator consistency studies mentioned in section 3.2, we tested the success of the LLM annotator, as shown in Table 2. Achieving a 94% accuracy rate and a 77% Cohen’s Kappa value, we consider the LLM annotator (with contextual examples) to be a reliable annotator. Inspired by this result, we had the LLM annotator label the remaining traces in our collected 150+ trace corpus, with results shown in Figure 4, and the final taxonomy with the distribution of failure modes is illustrated in Figure 2.

4Research Findings

Figure 4:Distribution of failure modes by category and system.

Our grounded theory research and inter-annotator consistency studies conducted on a range of different MAS contributed to the development of MASFT as shown in Figure 2. MASFT organizes three overall failure categories, identifying 14 granular failure modes that MAS may encounter during execution. MASFT also divides MAS execution into three stages related to agents: pre-execution, execution, and post-execution, determining the MAS execution stage at which each granular failure mode may occur.

FC1. Specification and System Design Failures. Failures due to defects in system architecture design, poor dialogue management, unclear or violated task specifications, and insufficient definition or adherence to agent roles and responsibilities often lead to task failures in MAS. However, even when clear specifications are provided, MAS may still misalign with user inputs. One example of such failure is violating task specifications. When asked to design a two-player chess game with classic chess notation (e.g., “Ke8”, “Qd4”) as input, the MAS framework ChatDev generates a game that uses (x1, y1), (x2, y2) as input, which represent the initial coordinates of the pieces on the board and the final coordinates of the pieces, thus failing to meet the initial requirement. Another failure pattern of this type is non-compliance with role specifications. For instance, during the requirements analysis phase of ChatDev, the CPO agent occasionally assumes the role of CEO by unilaterally defining the product vision and making final decisions. FC2. Misalignment Among Agents. Failures due to poor communication, ineffective collaboration, conflicting behaviors among agents, and gradually deviating from the initial task.The telephone agent failed to communicate the API specifications and login username requirements to the supervisor. On the other end of the dialogue, the supervisor agent also failed to clarify the login details. After several back-and-forth attempts, the supervisor agent marked the task as failed.Multi-agent systems often suffer from inefficiencies in dialogue, where agents engage in ineffective communication, consuming computational resources without making meaningful progress. For example, in the ChatDev tracking involving the creation of a Wordle-like game, the programmer agent interacted with multiple roles (CTO, CCO, etc.) over seven cycles but failed to update the initial code. The resulting game is playable but lacks robustness, with only five simple words, undermining replayability and leading to wasted additional communication rounds. Another failure pattern in this category is agents withholding valuable information. For instance, in Figure 5, the supervisor agent instructs the telephone agent to use the email ID as the username to retrieve contact information. The telephone agent, after reading the documentation and discovering that the correct username should be the phone number, continues to operate with incorrect credentials, leading to an error.FC3. Task Validation and Termination. Failures due to premature termination of execution and a lack of sufficient mechanisms to ensure the accuracy, completeness, and reliability of interactions, decisions, and outcomes.MAS may not have undergone specialized validation steps during development, or may include validation agents that cannot effectively execute their tasks. For example, in the ChatDev scenario involving the implementation of a chess game, the validation agent only checks whether the code compiles, without running the program or ensuring compliance with chess rules. Chess is a mature game with extensive specifications, rules, and implementations readily available online. Even simple retrieval should intuitively prevent trivial failures, such as accepting incorrectly formatted inputs. However, without proper validation, defects such as invalid input handling or format errors in interfaces persist, rendering the game unplayable.

Figure4 shows the distribution of granular failure modes and failure categories in the studied MAS. Different colors represent different failure categories in MASFT, while different shades represent different granular failure modes within the categories. We emphasize that no single error category dominates, indicating diversity in failure occurrences and the robustness of the taxonomy used for classification. Additionally, we note that, as expected, different MAS exhibit different distributions of failure categories and modes. For example, compared to specification and validation issues, AG2 has fewer instances of misalignment among agents, while ChatDev encounters fewer validation issues than specification and misalignment challenges. These differences stem from varying problem settings, which affect system topology design, communication protocols, and interaction management. In turn, these factors shape systems with their own strengths and weaknesses.

Why Do Multi-Agent LLM Systems Fail?

Figure 6:Correlation matrix of MAS failure categories.

Figure 6 highlights the correlations between different failure categories in MASFT. The observed correlations are not particularly strong, indicating that the proposed taxonomy is a reasonable classification framework. Furthermore, this suggests that failures are not isolated events; rather, they may have cascading effects that can impact other failure categories.

Figure 7: Correlation matrix of MAS failure modes.

4.3Is It All the Validator’s Fault?

We have identified a range of failure modes in MAS. However, it can be argued that ultimately, each failure may stem from a lack of proper validation or incorrect validation processes. If we assume that the validation agent functions perfectly, then all failures are detectable and thus avoidable.

In our study, we focused on validation issues in situations where the system could effectively benefit from the results of the validation process. However, we also examined other failure modes that occurred prior to the final validation step. In many cases, we can view validation as the last line of defense against failures. This leads us to conclude that while many issues can indeed be traced back to insufficient validation, not every problem can be entirely attributed to this factor. Other factors, such as poor specifications, inadequate design, and inefficient communication, can also lead to failures. Therefore, a comprehensive approach to understanding and addressing MAS failures must consider a broader range of factors beyond just validation defects.

4.4MASFT Failure Modes Violate HRO Defining Characteristics

While we encountered some common LLM failure modes, such as text repetition, we excluded them from MASFT as these issues are not specifically related to MAS and may even occur in single LLM invocation pipelines. On the other hand, we found evidence that MAS face issues similar to those in complex human organizations, as the failure modes align with common failure patterns observed in human organizations. Roberts & Rousseau (1989) identified eight key characteristics shared by high-reliability organizations (HRO). The MASFT discovered through GT contains no prior biases, including several failure modes related to the unique characteristics identified by Roberts & Rousseau. Specifically, FM1.2: Non-compliance with role specifications indicates that agents attempt to exceed their roles, violating the HRO characteristic of “extreme hierarchical differentiation.” Similarly, FM2.2: Failure to Seek Clarification undermines “respect for expertise.” The failure modes identified in MASFT directly violate HRO characteristics, validating the applicability of MASFT and the need for non-trivial interventions inspired by HRO. For example, to prevent the occurrence of FM1.2: Non-compliance with role specifications in MAS, orchestration and role assignment can enforce hierarchical differentiation.

5Towards Better Multi-Agent LLM Systems

In this section, we discuss several strategies to make MAS more resilient to failures. We categorize these strategies into two main classes: (i) tactical approaches and (ii) structural strategies. Tactical approaches involve direct modifications tailored to specific failure modes, such as improving prompts, agent network topology, and dialogue management. In section 6, we experiment with these methods in two case studies and demonstrate that the effectiveness of these methods is inconsistent. This prompts us to consider the second class of strategies, which are more comprehensive approaches with system-wide impacts: strong validation, enhanced communication protocols, uncertainty quantification, and memory and state management. These strategies require deeper research and careful implementation and remain open research topics for future exploration. Table 3 outlines the mapping of the different solution strategies we propose to the failure categories.

5.1Tactical Approaches

This category includes strategies related to improving prompts and optimizing agent organization and interactions. The prompts for MAS agents should provide clear descriptions of instructions and explicitly specify the role of each agent. Prompts can also clarify roles and tasks while encouraging proactive dialogue. If inconsistencies arise, agents can re-engage or retry, as illustrated in the following prompt:After completing complex multi-step tasks, adding a self-validation step in the prompt to retrace reasoning processes by restating solutions, checking conditions, and testing for errors. However, it may overlook defects, rely on ambiguous conditions, or be impractical. Additionally, defining dialogue patterns and setting termination conditions can reinforce clear role specifications. A modular approach using simple, well-defined agents (rather than complex, multi-task agents) can enhance performance and simplify debugging. Group dynamics also enable other interesting possibilities for multi-agent systems: different agents can propose various solutions, simulating the academic peer review process to capture deeper inconsistencies.

Prompt: Your role is to critically evaluate the solutions proposed by other agents step by step and provide the final solution. 1. **Solution Requirements**: Before making any decisions, ensure you have received solutions from the agent code executor and the agent problem solver. If any suggested solutions are missing, do not draw any conclusions; instead, suggest the next speaker, stating: Suggested next speaker: _suggested agent name_. 2. **Avoid Assumptions**: Pay attention to the variables provided in the original problem statement versus the variables assumed by the agents. **Assumed values are invalid for the solution** and may lead to inaccuracies. Never base solutions on assumed values; always base solutions on clearly given variables to ensure correctness. If the problem cannot be solved due to lack of information, return: **SOLUTION_FOUND \ boxed { ' None ' }**. 3. **Evaluate Conflicting Solutions**: If different answers arise during the discussion, choose the most appropriate solution based on your evidence or engage in further discussion to clarify. 4. **Final Solution Statement**: When you are confident in the final solution, return as follows: **SOLUTION_FOUND \ boxed { _solution_value_here_ }**. Ensure that only the value is placed in \ boxed { }; any accompanying text should be outside.

Another set of cross-validation strategies includes multiple LLM calls and majority voting or resampling until validation is complete. However, these seemingly simple solutions often prove inconsistent, echoing the results of our case studies. This emphasizes the need for more robust structured strategies, as discussed in the next section.

5.2Structural Strategies

In addition to the tactical approaches discussed above, more complex solutions are needed to shape the MAS structure at hand. We first observe the critical role of validation processes and validation agents in multi-agent systems. Our annotators indicate that weak or insufficient validation mechanisms are significant contributors to system failures. While unit test generation aids validation in software engineering, creating a universal validation mechanism remains challenging. Even in coding, covering all edge cases is complex, even for experts. Validation varies by domain: coding requires comprehensive test coverage, QA requires certified data checks, and reasoning benefits from symbolic validation. Cross-domain adaptation of validation remains an ongoing research challenge.

A complementary strategy to validation is to establish standardized communication protocols. LLM-based agents primarily communicate through unstructured text, leading to ambiguity. Clearly defining intents and parameters can enhance consistency and facilitate formal consistency checks during and after interactions. The introduction of multi-agent graph attention mechanisms leverages graph attention to simulate agent interactions and enhance coordination. Similarly, attention communication allows agents to selectively focus on relevant information. Additionally, developing a learning selective communication protocol can improve collaborative efficiency.

Another important research direction is to fine-tune MAS agents using reinforcement learning. Agents can be trained with role-specific algorithms that reward task-consistent actions and penalize inefficient behaviors. MAPPO optimizes agents’ adherence to defined roles. Similarly, SHPPO uses latent networks to learn strategies before applying heterogeneous decision layers. Optima further enhances communication efficiency and task effectiveness through iterative reinforcement learning.

On the other hand, incorporating probability confidence measures into agent interactions can significantly improve decision-making and communication reliability. Drawing inspiration from the framework proposed by Horvitz et al., agents can be designed to act only when their confidence exceeds a predefined threshold. Conversely, when confidence is low, agents can pause to gather more information. Additionally, systems can benefit from adaptive thresholds, where confidence thresholds are dynamically adjusted.

Although memory and state management are often viewed as single-agent attributes, they are crucial for multi-agent interactions, enhancing contextual understanding and reducing ambiguity in communication. However, most research has focused on single-agent systems. MemGPT introduces context management inspired by operating systems to extend context windows, while TapeAgents use structured, replayable logs (“tapes”) to iteratively record and improve agent operations, facilitating dynamic task decomposition and continuous improvement.

6 Case Studies

In this section, we present two case studies applying some tactical approaches.

6.1 Case Study 1: AG2 – MathChat

In this case study, we implement the MathChat scenario in AG2 as our baseline, where a student agent collaborates with an assistant agent capable of executing Python code to solve problems. For benchmarking, we randomly selected 200 exercises from the GSM-Plus dataset, an enhanced version of GSM8K, and added various adversarial perturbations. The first strategy was to improve the original prompt, making its structure clear and adding a dedicated section for validation. Detailed prompts are provided in Appendices E.1 and E.2. The second strategy was to refine the agent configuration into a more specialized system with three distinct roles: problem solver, using a thought chain approach to solve problems without using tools; coder, writing and executing Python code to arrive at the final answer; validator, responsible for reviewing discussions and critically evaluating solutions, either confirming answers or prompting further debate. In this case, only the validator could terminate the dialogue after a solution was found. To evaluate the effectiveness of these strategies, we conducted benchmarking experiments using two different LLMs (GPT-4 and GPT-4o) across three configurations (baseline, improved prompts, and new topology). We also performed six repeated experiments to assess the consistency of the results. Table 4 summarizes the results.

Table 4: Accuracy comparison of case studies. Under AG2, results for GSM-Plus using GPT-4 and GPT-4o; under ChatDev, results for ProgramDev and HumanEval.

The second column of Table 4 shows that using GPT-4, the validated improved prompts significantly outperformed the baseline. However, the new topology did not yield the same improvement. The p-value from the Wilcoxon test was 0.4, indicating that the slight improvement was not statistically significant. For GPT-4o (the third column of Table 4), the p-value from the Wilcoxon test when comparing the baseline with the improved prompts and new topology was 0.03, indicating a statistically significant improvement. These results suggest that optimizing prompts and clearly defining agent roles can reduce failures. However, these strategies are not universal, and their effectiveness may vary depending on factors such as the underlying LLM.

6.2 Case Study 2: ChatDev

ChatDev simulates a multi-agent software company where different agents have different role specifications, such as CEO, CTO, software engineers, and reviewers, attempting to collaboratively solve software generation tasks. To address the challenges we frequently observed in tracking, we implemented two different interventions. Our first solution was to improve role-specific prompts to enforce hierarchy and role compliance. For example, we observed instances where the CPO prematurely ended discussions with the CEO without fully resolving constraints. To prevent this, we ensured that only senior agents could complete the dialogue. Additionally, we enhanced the validator role specifications to focus on task-specific edge cases. Detailed information about these interventions can be found in section F. The second solution attempted a fundamental change to the framework topology. We modified the framework’s topology from a directed acyclic graph (DAG) to a cyclic graph. Now, the process only terminates when the CTO agent confirms that all comments have been adequately addressed, with a maximum iteration cutoff to prevent infinite loops. This approach allows for iterative refinement and more comprehensive quality assurance. We tested our interventions in two different benchmarks. The first was a custom-generated set of 32 different tasks (which we call ProgramDev), where we asked the framework to generate various programs, from “write me a two-player chess game to play in the terminal” to “write me a BMI calculator.” The second benchmark was OpenAI’s HumanEval task. We report our results in Table 4. Note that while our interventions successfully improved the framework’s performance across different tasks, they did not yield substantial improvements, necessitating the more comprehensive solutions outlined in section 5.2.

7Conclusion

In this study, we systematically investigated the failure modes of LLM-based multi-agent systems for the first time. Guided by GT theory, we collected and analyzed over 150 trajectories and iteratively refined our taxonomy through inter-annotator studies for validation. We identified 14 granular failure modes and categorized them into 3 different failure categories, providing a standard for future MAS research. We also proposed an LLM annotator as an automated method for analyzing MAS trajectories and demonstrated its effectiveness and reliability. We discussed two sets of solutions for all failure categories, namely tactical and structural strategies. After conducting case studies on some tactical strategies, our findings indicate that many of these “obvious” fixes actually have significant limitations, necessitating the structural strategies we outlined to achieve more consistent improvements.

Original link:https://arxiv.org/html/2503.13657v1

Analyzing methods and strategies for the DeepSeek reasoning model
Detailed training steps for DeepSeek R1 and R1-Zero
The future of reasoning model Scaling Law from Deepseek R1
OpenAI O3 paper “Emergence” from self-validation and adaptive reasoning capabilities, surprisingly winning an international Olympic gold medal
Docker’s competitive leap: Running LLM large models with one-click localization through Containers
If the model is the product, many AI companies are doomed to fail
The Moore’s Law of AI Agent autonomy, doubling every 7 months
OpenAI exposed PhD-level AI agents at $20,000/month, expected to replace university professors
Google releases Co-scientist: Multi-Agent AI system accelerating scientific breakthroughs
60 images to easily understand large language model AI Agents
Six elements of AI Agent engineering
Anthropic drives the “Manus” craze, with monthly revenue soaring 40% to $115 million
A16z releases AI product traffic and revenue rankings: AI landscape changes dramatically
CB Insights deep report: Market map of over 170 AI Agent startups

Below is the translation of the paper, Enjoy.

Introduction

2Related Work

2.1Challenges of Agent Systems

2.2Design Principles for Agent Systems

2.3Failure Classification in LLM Systems

3Research Methodology

3.1Data Collection and Analysis

3.2Inter-Annotator Consistency Studies and Iterative Improvements

3.3LLM Annotator

4Research Findings

4.3Is It All the Validator’s Fault?

4.4MASFT Failure Modes Violate HRO Defining Characteristics

5Towards Better Multi-Agent LLM Systems

5.1Tactical Approaches

5.2Structural Strategies

7Conclusion

60 images to easily understand large language model AI Agents

Six elements of AI Agent engineering

Anthropic drives the “Manus” craze, with monthly revenue soaring 40% to $115 million

A16z releases AI product traffic and revenue rankings: AI landscape changes dramatically

CB Insights deep report: Market map of over 170 AI Agent startups

Related posts

Leave a Comment Cancel reply