Why do multi-agent LLM systems fail?
Abstract
Despite the growing enthusiasm for multi-agent systems (MAS), which consist of multiple LLM agents collaborating to complete tasks, their performance improvements in popular benchmark tests remain minimal compared to single-agent frameworks. This gap highlights the necessity of analyzing the challenges that hinder the efficiency of MAS.
In this paper, we present the first comprehensive study of the challenges facing MAS. We analyze five popular MAS frameworks, covering over 150 tasks and involving six professional human annotators. We identify 14 unique failure modes and propose a comprehensive taxonomy applicable to various MAS frameworks. This taxonomy is derived from the consistency of three expert annotators across each study, achieving a Cohen’s Kappa score of 0.88. These granular failure modes are organized into three categories: (i) specification and system design failures, (ii) misalignment among agents, and (iii) task validation and termination. To support scalable evaluation, we integrate MASFT with LLM-as-a-Judge. We also explore whether the identified failures can be easily prevented by proposing two interventions: improved specifications for agent roles and enhanced orchestration strategies. Our findings indicate that the identified failures require more complex solutions, providing a clear roadmap for future research. We open-source our dataset and LLM annotations.
“Happy families are all alike; every unhappy family is unhappy in its own way.” (Tolstoy, 1878)
“Successful systems operate in the same way; every failed system has its own problems.” (Berkeley, 2025)
1. Introduction
Recently, agent systems based on large language models (LLMs) have garnered significant attention in the AI community. This growing interest stems from the ability of agent systems to handle complex, multi-step tasks while dynamically interacting with various environments, making LLM-based agent systems particularly suitable for real-world problems. Building on this characteristic, multi-agent systems are increasingly being explored across various fields, such as software engineering and drug discovery.
Although the formal definition of agents remains a topic of debate, in this study, we define LLM-based agents as artificial entities with prompt specifications (initial state), dialogue trajectories (state), and the ability to interact with the environment (e.g., tool usage). Multi-agent systems (MAS) are then defined as a set of agents designed to interact through orchestration to achieve collective intelligence. The structure of MAS is intended to coordinate efforts, achieve task decomposition, performance parallelization, context isolation, specialized model integration, and diverse reasoning discussions.
Our research on multi-agent systems reveals various failure modes. These modes are organized into three categories: system design failures, misalignment among agents, and task validation failures. Specifically, system design failures include failure to adhere to task specifications or role specifications, as well as issues such as step repetition. Misalignment among agents manifests as dialogue resets, failure to seek clarification, task deviation, information concealment, ignoring inputs from other agents, and reasoning-action mismatches. Finally, task validation failures include premature termination, failure to conduct or incomplete validation, and incorrect validation.
Figure 2 illustrates the classification of failure modes in multi-agent systems. The dialogue phase among agents indicates that failures may occur in end-to-end multi-agent systems. If a failure mode spans multiple phases, it indicates that the issue involves or may occur at different stages. The percentages represent the frequency of occurrence of each failure mode and category in the analysis of 151 trajectories.
Despite the increasing popularity of multi-agent systems (MAS), the improvements in accuracy or performance compared to single-agent frameworks or even the N-sampling best strategies in popular benchmark tests remain negligible. Our empirical analysis indicates that the accuracy of the state-of-the-art open-source MAS ChatDev may be as low as 25%. Furthermore, there is no clear consensus on how to build robust and reliable MAS, leading to a fundamental question that we need to answer first: why do MAS fail?
To understand the failure modes of multi-agent systems (MAS), we conducted the first systematic evaluation of MAS execution trajectories using grounded theory. We analyzed five popular open-source MAS and used six expert annotators to identify granular issues in 150 dialogue trajectories, each with an average of over 15,000 lines of text. We define a failure as a situation where MAS fails to achieve the expected task goals. To ensure consistency in failure modes and definitions, three expert annotators independently annotated 15 trajectories, achieving a Cohen’s Kappa score of 0.88, indicating a high level of agreement among annotators. Through this comprehensive analysis, we identified 14 different failure modes and clustered them into three main failure categories. We introduced the Multi-Agent System Failure Taxonomy (MASFT), which is the first structured failure taxonomy for MAS. We do not claim that MASFT covers all potential failure modes; rather, it serves as a first step in building a failure taxonomy for MAS and understanding MAS failures.
To achieve scalable automated evaluation, we introduced a process using an LLM as an evaluator, utilizing OpenAI’s model. To validate this process, we cross-validated its annotations with those of three human experts on 10 trajectories, ultimately achieving a Cohen’s Kappa consistency rate of 0.77.
Intuitively, better specifications and prompting strategies may alleviate MAS failures. To validate this hypothesis, we implemented interventions using prompt engineering and enhanced agent orchestration. Our case studies on AG2 and ChatDev indicate that while these interventions brought a +14% improvement to ChatDev, they did not resolve all failure scenarios. Moreover, the improved performance remains insufficient for practical deployment.
These findings suggest that MASFT is not a mere artifact of existing multi-agent frameworks but rather a reflection of inherent flaws in multi-agent design. To build robust and reliable multi-agent systems, MASFT serves as a framework guiding future research, outlining potential solutions for the 14 identified failure modes. Additionally, we open-source our annotations for further multi-agent system research.
While these failures can be simply attributed to the current limitations of LLMs, such as hallucinations and misalignments, we speculate that improvements in the capabilities of underlying models are insufficient to address the complete MASFT. Instead, we believe that good MAS design requires organizational understanding— even technically proficient organizations can fail catastrophically if their organizational structure is flawed. Previous research on high-reliability organizations indicates that clearly defined principles can prevent such failures. Consistent with these theories, our findings suggest that many MAS failures stem from challenges in inter-agent interactions rather than limitations of individual agents. MASFT can systematically identify these failures and inform the design principles for the next generation of MAS. The contributions of this paper are as follows:
We introduce MASFT, the first empirically-based failure taxonomy for MAS, providing a structured framework for understanding and mitigating MAS failures.
• We develop a scalable LLM as an evaluation pipeline for analyzing new MAS performance and diagnosing failure modes.
• We conduct best-effort intervention studies targeting agent specifications, dialogue management, and validation strategies. Despite achieving a 14% improvement in task completion, they fail to fully resolve MAS failures, highlighting the need for a structured redesign of MAS.
• We fully open-source: (1) all 150+ annotated MAS dialogue trajectories, (2) a scalable LLM as an evaluator for the assessment pipeline and LLM annotations on 150+ trajectories, and (3) detailed expert annotations on 15 selected trajectories.
2. Related Work
2.1. Challenges Faced in Agent Systems
The promising capabilities of agent systems inspire research into specific agent challenges. For example, Agent Workflow Memory (Wang et al., 2024e) addresses long-term navigation issues by introducing workflow memory. DSPy (Khattab et al., 2023) and Agora (Wang et al., 2024e) tackle issues in communication processes, while StateFlow (Wu et al., 2024b) focuses on state control in agent workflows to enhance task-solving capabilities. While these works contribute meaningfully to specific use cases, they do not provide a comprehensive understanding of the reasons for MAS failures nor propose a strategy that can be broadly applied across various domains. Numerous benchmarks have been proposed to evaluate agent systems (Jimenez et al., 2024; Peng et al., 2024; Wang et al., 2024c; Anne et al., 2024; Bettini et al., 2024; Long et al., 2024). These evaluations are crucial for identifying challenges and limitations in agent systems, but they primarily promote a top-down perspective, focusing on high-level goals such as task performance, trustworthiness, safety, and privacy (Liu et al., 2023; Yao et al., 2024b).
2.2. Design Principles for Agent Systems
Several studies emphasize the challenges of building robust agent systems and propose new strategies, often used in single-agent design, to enhance reliability. For instance, a blog post by Anthropic (Anthropic, 2024a) highlights the importance of modular components, such as prompt chains, rather than adopting overly complex frameworks. Similarly, Kapoor et al. (2024) indicate that complexity may hinder the application of agent systems in the real world. Our work extends these insights by systematically studying failure modes in MAS, providing a taxonomy that elucidates the reasons for MAS failures and proposes agent system design solutions consistent with these insights.
2.3. Failure Taxonomy in LLM Systems
Despite the growing interest in LLM agents, dedicated studies on their failure modes are surprisingly limited. Compared to the cataloging of human-agent interaction challenges in the study by Bansal et al. (2024), our contribution represents a pioneering effort in studying failure modes in MAS (multi-agent systems). This underscores the necessity for future research to develop robust evaluation metrics, identify common failure modes, and design mitigation strategies to enhance MAS reliability.
3. Research Methodology
This section describes our approach to identifying the main failure modes in multi-agent systems (MAS) and establishing a structured failure mode taxonomy. Figure 3 provides an overview of this workflow.
To systematically discover unbiased failure modes, we adopted a grounded theory (GT) approach (Glaser & Strauss, 1967), a qualitative research method that builds theory directly from empirical data rather than testing predefined hypotheses. The inductive nature of GT allows for the organic emergence of failure mode identification. We employed theoretical sampling, open coding, constant comparative analysis, memoing, and theorizing in an iterative process to collect and analyze MAS execution trajectories.
After collecting MAS trajectory data and discussing preliminary findings, we derived an initial taxonomy by gathering observed failure modes. To refine the taxonomy, we conducted an inter-annotator consistency study by adding and iteratively adjusting failure modes and failure categories.
Figure 3 outlines our research methodology workflow, including identifying failure modes, developing the taxonomy, and refining inter-annotator consistency by achieving a Cohen’s Kappa score of 0.88.
Table 1 lists the MAS systems we studied, each containing at least 30 manually annotated trajectories. Detailed information and additional system information can be found in Appendix B. These systems include MetaGPT, ChatDev, HyperAgent, AppWorld, and AG2. They employ different architectures, such as hierarchical workflows and star topologies, and serve various purposes, such as simulating standard operating procedures in software development.
To ensure the reliability of the taxonomy, we conducted an inter-annotator consistency study. Three expert annotators independently annotated 15 trajectories to verify, refine, and finalize the initially proposed taxonomy. Inter-annotator consistency was measured using the Cohen’s Kappa coefficient, ultimately reaching 0.88, indicating a high level of agreement among annotators.
This process involved iteratively adding and adjusting failure modes and failure categories until consensus was reached. We adopted a learning-like approach, continuously refining the taxonomy until stability was achieved. To enable automated failure identification, we developed an LLM-based annotation process and validated its reliability.
We employed a theoretical sampling method to ensure the diversity of the analyzed multi-agent systems (MAS). We collected over 150 trajectories from five different MAS, each with distinct goals, organizational structures, implementation methods, and underlying agent roles.
In analyzing these trajectories, we used open coding methods to break down qualitative data into labeled segments and create new codes to document observations. Annotators identified encountered failure modes and compared them with existing codes to ensure the continuous refinement of the taxonomy. This process continued until reaching the theoretical saturation point, where no new insights emerged from additional data.
Through memoing, annotators were able to engage in iterative reflection and collaboration, systematically refining the taxonomy. We ultimately grouped relevant open codes, revealing the initial version of the Multi-Agent System Failure Taxonomy (MASFT) and organizing it into the error categories shown in Figure 2.
To ensure the unambiguity of the taxonomy, we conducted a series of main discussions. First, we extracted five different MAS trajectories from over 150 trajectories and had three annotators annotate them using the failure modes and definitions from the initial taxonomy. Preliminary results showed weak consistency among annotators, with a Cohen’s Kappa score of 0.24. Subsequently, the annotators refined the taxonomy, including modifying failure mode definitions, breaking them down into more granular modes, merging different modes, adding new modes, or removing unnecessary modes.
This process resembled a learning study, where annotators independently collected observations from a shared state space and shared findings to reach consensus. To avoid confusion between training and testing data, we extracted another set of five trajectories for evaluating the refined taxonomy. Results showed a significant improvement in inter-annotator consistency, with an average Cohen’s Kappa score of 0.92. In a third round of testing, we extracted another set of five trajectories and again annotated them using the final taxonomy, achieving an average Cohen’s Kappa score of 0.84. Cohen’s Kappa scores above 0.8 are considered strong, while scores above 0.9 are regarded as nearly perfect alignment.
Inspired by the reliability of the taxonomy, we posed the question: can we propose a method for automatically annotating trajectories so that developers or users can use this process to understand the reasons for their model’s failures? Consequently, we developed an LLM-based automatic MASFT annotation method, which will be described in detail in Section 3.3.
Table 2 showcases the performance of the LLM as an evaluator. We experimented with two scenarios: no examples provided (o1) and examples provided (o1 few-shot). Results indicated that in the case of provided examples, the LLM annotator achieved a high accuracy (0.89) and Cohen’s Kappa value (0.77), indicating it is a reliable annotator. Inspired by this result, we utilized this process to analyze more MAS trajectories.
Figure 4 illustrates the distribution of failure modes by category and system. Our grounded theory research and inter-rater consistency study contributed to the development of the Multi-Agent System Failure Taxonomy (MASFT) presented in Figure 2. MASFT organizes failures into three overarching categories and identifies 14 granular failure modes that multi-agent systems may encounter during execution. MASFT also divides multi-agent system execution into three agent-related phases: pre-execution, execution, and post-execution, thereby identifying the execution phase in which each granular failure mode may occur.
4.1. Failure Categories
In this section, we briefly describe the overall failure categories (FC) in MASFT. Appendix A provides detailed definitions of the 14 granular failure modes in MASFT. Additionally, Appendix D provides detailed examples of each granular failure mode in MASFT.
FC1. Specification and System Design Failures. These issues stem from flaws in system architecture design, poor dialogue management, ambiguous task specifications or constraint violations, and insufficient definition or adherence to agent roles.
In multi-agent systems, task failures often arise from incomplete or ambiguous instructions. However, even when clear specifications are provided, multi-agent systems may still be inconsistent with user inputs. One example of this type of failure is a violation of task specifications. When asked to generate a two-player chess game that accepts standard algebraic chess move notation (e.g., “Ke8”, “Qd4”) as input, the ChatDev framework generates a game that accepts (x1, y1), (x2, y2) as input, representing the initial and final coordinates of pieces on the board, thus failing to meet the initial requirement.
Another failure mode in this category is the non-compliance with role specifications. For instance, during the requirements analysis phase of ChatDev, the CPO agent occasionally assumes the role of the CEO, unilaterally defining the product vision and making final decisions.
FC2. Misalignment Among Agents. These failures stem from ineffective communication, poor collaboration, conflicting behaviors among agents, and gradual deviation from the initial task.
Multi-agent systems are often plagued by inefficient dialogues, where agents engage in unproductive exchanges, consuming computational resources without making meaningful progress. For example, in a ChatDev trajectory involving the creation of a Wordle-like game, the programmer agent interacts with multiple roles (CTO, CCO, etc.) over seven cycles but fails to update the initial code. The generated game is playable but lacks robustness, featuring only five simple words, undermining replayability and rendering additional communication rounds futile.
Another type of failure mode in this category is the concealment of valuable information by agents. For instance, in Figure 5, the supervisor agent instructs the phone agent to use the email ID as a username to retrieve contact information. Despite reading the documentation and discovering that the correct username should be the phone number, the phone agent continues to operate with incorrect credentials, leading to an error.
FC3. Task Validation and Termination. These issues stem from premature termination, failure to conduct or incomplete validation, and incorrect validation.
Figure 6 shows the correlation between different failure categories in MASFT. While the observed correlations are not strong, they indicate that the proposed taxonomy is a reasonable classification framework. Furthermore, this suggests that failures are not isolated events; rather, they may have cascading effects that impact other failure categories. More detailed information can be found in Appendix, where Figure 7 reports the correlations between different failure modes.
5. Discussion
Our research reveals the complexities of multi-agent system (MAS) failures and presents a comprehensive failure taxonomy (MASFT). Through systematic evaluation of five popular open-source MAS, we identified 14 unique failure modes and organized them into three main categories: specification and system design failures, misalignment among agents, and task validation and termination.
Our findings emphasize the inherent challenges in MAS design and indicate that merely enhancing the capabilities of underlying models may not suffice to address all issues. Many failures stem from challenges in inter-agent interactions rather than limitations of individual agents. This aligns with findings from research on high-reliability organizations, which indicate that clearly defined principles can prevent catastrophic failures.
MASFT can serve as a framework guiding future research, outlining potential solutions for the 14 failure modes. For instance, to address specification and system design failures, future work could focus on developing clearer and more explicit task specifications, as well as more robust system architectures. To address misalignment among agents, future work could explore improved communication protocols and collaboration mechanisms. To address task validation and termination issues, future work could focus on developing more effective validation procedures and termination criteria.
Our findings also indicate the need for developing more effective evaluation metrics and diagnostic tools to identify and address failures in MAS. The LLM-based process we proposed as an evaluator provides a promising avenue for achieving this goal. By automating failure detection and classification, we can accelerate the development and improvement of MAS.
6. Limitations and Future Directions
While our research provides valuable insights into understanding MAS failures, it also has some limitations. First, our study is limited to five open-source MAS. Future work could expand to a broader range of MAS to enhance the generalizability of our findings. Second, our research primarily focuses on the accuracy of task completion. Future work could explore other important metrics, such as efficiency, interpretability, and robustness. Third, our research mainly focuses on single-turn tasks. Future work could explore multi-turn tasks and continuous interaction in MAS.
Moreover, we acknowledge that MASFT is not an exhaustive taxonomy. As the field of MAS continues to evolve, new failure modes may emerge. Therefore, we need to continuously refine and update MASFT to reflect the latest research findings.
Future work could also explore integrating MASFT with other relevant frameworks and theories. For example, we could combine MASFT with human-computer interaction (HCI) theories to better understand user experiences in MAS. We could also integrate MASFT with software engineering theories to better understand the design and implementation of MAS.
Finally, we believe that open-sourcing our dataset and LLM annotations is crucial for advancing research in the MAS field. We hope our work inspires others to explore the challenges and opportunities in MAS and contribute to building more reliable and intelligent multi-agent systems.
Figure 6. Correlation matrix of MAS failure categories.
4.3. Is it all your evaluator’s fault?
We have identified a range of failure modes in MAS. However, it can be argued that ultimately, all failures may stem from a lack of proper validation or incorrect validation processes. If we assume that the validation agent functions correctly, then all failures would be detected and thus avoided.
In our study, we focus on validation issues in cases where the system can effectively benefit from the results of the validation process. However, we also examine other failure modes that occur before the final validation step. In many cases, we can view validation as the last line of defense against failures. This leads us to conclude that while many issues can indeed be traced back to insufficient validation, not all problems can be entirely attributed to this factor. Other factors, such as poor specifications, inadequate design, and inefficient communication, can also lead to failures. Therefore, to fully understand and address failures in MAS, it is essential to consider broader factors beyond insufficient validation.
4.4. MASFT Failure Modes Violate HRO Defining Characteristics
Even when we encounter some common LLM failure modes, such as text repetition, we exclude them from MASFT because these issues are not specifically related to MAS and may occur even in single LLM invocation pipelines. On the other hand, we find that MAS faces issues similar to those of complex human organizations, as its failure modes align with common failure modes observed in human organizations. Roberts & Rousseau (1989) identified eight key characteristics shared by high-reliability organizations (HRO). Through grounded theory, MASFT identifies some failure modes related to unique characteristics of HRO as determined by Roberts & Rousseau (1989). Specifically, “FM1.2: Violation of Role Specifications” undermines the HRO characteristic of “extreme hierarchical differentiation.”
Similarly, “FM2.2: Failure to Seek Clarification” undermines “respect for expertise.” The direct violation of HRO characteristics by the failure modes identified in MASFT validates the applicability of MASFT and the need for non-trivial interventions inspired by HRO. For example, to prevent “FM1.2: Violation of Role Specifications” from occurring in MAS, orchestration and role assignments can enforce hierarchical differentiation.
5. Towards Better Multi-Agent LLM Systems
In this section, we discuss some methods to enhance the fault tolerance of MAS. We categorize these strategies into two main types: (i) tactical approaches and (ii) structured strategies. Tactical approaches involve direct modifications targeting specific failure modes, such as improving prompts, agent network topology, and dialogue management. In Section 6, we experiment with these methods through two case studies and demonstrate that the effectiveness of these methods is inconsistent. This prompts us to consider the second category of strategies: more comprehensive approaches with system-wide impacts, such as strong validation, enhanced communication protocols, uncertainty modeling, and memory and state management. These strategies require deeper research and meticulous implementation and remain open research topics for future exploration. See Table 3 for the mapping of the different solution strategies we propose to the failure categories.
5.1. Tactical Approaches
This category includes strategies related to improving prompts and optimizing agent organization and interaction. The prompts for multi-agent systems (MAS) agents should provide clear descriptions of instructions and explicitly specify the role of each agent (see E.2 as an example) (He et al., 2024a; Talebirad & Nadiri, 2023). Prompts can also clarify roles and tasks while encouraging agents to engage in clear communication and collaboration.
Encouraging proactive dialogue. If inconsistencies arise, agents can re-engage or retry, as shown in Appendix E.5 (Chan et al., 2023). After completing complex, multi-step tasks, adding self-validation steps in prompts can trace reasoning processes by restating solutions, checking conditions, and testing for errors (Weng et al., 2023). However, it may overlook defects, rely on vague conditions, or be impractical (Stoica et al., 2024b). Additionally, clear role specifications can be reinforced by defining dialogue patterns and setting termination conditions (Wu et al., 2024a; LangChain, 2024). Adopting a modular approach, using simple, well-defined agents instead of complex, multi-task agents can enhance performance and simplify debugging (Anthropic, 2024b). Group dynamics also offer other interesting possibilities for multi-agent systems: different agents can propose various solutions (Yao et al., 2024a), discuss their assumptions and findings (cross-validation) (Haji et al., 2024). For example, in (Xu et al., 2023), a multi-agent simulation mimics the academic peer review process to uncover deeper inconsistencies.
Another set of tactics for cross-validation includes multiple calls to the LLM and employing majority voting or resampling until validation is complete (Stroebl et al., 2024; Chen et al., 2024a). However, these seemingly simple solutions often prove inconsistent, echoing our case study results. This emphasizes the need for more robust, structured strategies, as discussed in the following sections.
5.2. Structured Strategies
In addition to the tactical approaches discussed above, deeper solutions are needed to shape the structure of the MAS at hand. We first observe the critical role of validation processes and validation agents in multi-agent systems. Our annotations indicate that weak or insufficient validation mechanisms are significant contributors to system failures. While unit test generation aids validation in software engineering (Jain et al., 2024), creating universal validation mechanisms remains challenging. Even in coding, covering all edge cases is complex, even for experts. Validation varies by domain: coding requires thorough test coverage, quality assurance necessitates certified data checks (Peng et al., 2023), and reasoning benefits from symbolic validation (Kapanipathi et al., 2020). Adjusting validation across different domains remains an ongoing research challenge.
Complementing validation strategies is the establishment of standardized communication protocols (Li et al., 2024b). LLM-based agents primarily communicate through unstructured text, leading to ambiguity. Clearly defining intents and parameters can enhance alignment and enable formal coherence checks during and after interactions. (Niu et al., 2021) introduces a multi-agent graph attention mechanism that leverages graph attention to model agent interactions and enhance coordination. Similarly, (Jiang & Lu, 2018) propose an attention communication mechanism that allows agents to selectively focus on relevant information. Likewise, (Singh et al., 2018) develop a learned selective communication protocol to improve cooperation efficiency.
Another important research direction is fine-tuning MAS agents using reinforcement learning. Agents can be trained using algorithms specific to their roles, rewarding task-aligned behaviors and penalizing inefficiencies. MAPPO (Yu et al., 2022) optimizes agents’ adherence to established roles. Similarly, SHPPO (Guo et al., 2024b) uses latent networks to learn strategies, which are then applied to heterogeneous decision layers. Optima (Chen et al., 2024b) further enhances communication efficiency and task effectiveness through iterative reinforcement learning.
On the other hand, incorporating probabilistic confidence into agent interactions can significantly enhance decision reliability and communication reliability. Inspired by the framework proposed by Horvitz et al., agents can express uncertainty about their predictions and actions. This allows other agents to weigh risks and make more informed decisions. For instance, agents can provide confidence intervals or probability distributions to quantify the reliability of their predictions. This can help other agents identify potential errors and take corrective actions.
In summary, building reliable multi-agent LLM systems requires a multifaceted strategy, including tactical and structured approaches. By addressing key challenges such as validation, communication, and uncertainty, we can construct more robust and intelligent multi-agent systems capable of solving complex problems and achieving goals beyond human reach.
6.2. Case Study 2: ChatDev
ChatDev (Qian et al., 2023) simulates a multi-agent software company where different agents have distinct role specifications, such as CEO, CTO, software engineers, and reviewers, attempting to collaborate on software generation tasks. To address the challenges we frequently observe in tracking, we implemented two different interventions. Our first solution was to refine prompts for specific roles to strengthen hierarchy and role adherence. For example, we observed that the CPO prematurely ended discussions with the CEO before fully resolving constraints. To prevent this, we ensured that only superior agents could finalize dialogues. Additionally, we enhanced the validation role specifications to focus on task-specific edge cases. Detailed information on these interventions can be found in Section F. The second solution attempts a fundamental change to the framework. We modified the framework to adopt a stricter hierarchy, where each agent must report to its superior and obtain approval before proceeding to the next step. This modification aims to reduce ambiguity in communication and ensure that all decisions align with overall objectives.
To evaluate the effectiveness of these interventions, we conducted experiments using the ChatDev framework on two datasets: one for generating simple Python functions and another for building more complex web applications. We used the same evaluation metrics to measure performance, including code correctness, readability, and efficiency.
Experimental results indicate that refining prompts and modifying the framework positively impacted ChatDev’s performance. However, these improvements were not always significant and varied across different datasets. For instance, in the simple Python function generation task, refining prompts significantly improved code correctness but had a smaller impact on readability and efficiency. In the more complex web application building task, modifying the framework significantly improved code efficiency but had a smaller impact on correctness and readability.
These results suggest that building reliable multi-agent systems requires careful consideration of various factors, including task complexity, interactions among agents, and the selection of evaluation metrics. Furthermore, different interventions may have varying impacts on different tasks, necessitating careful experimentation and analysis to determine the best strategies.
7. Discussion
This study reveals the challenges faced by multi-agent LLM systems and presents a comprehensive failure taxonomy (MASFT). Through systematic evaluation of five popular open-source MAS, we identified 14 unique failure modes and organized them into three main categories: specification and system design failures, misalignment among agents, and task validation and termination.
Our findings emphasize the inherent challenges in MAS design and indicate that merely enhancing the capabilities of underlying models may not suffice to address all issues. Many failures stem from challenges in inter-agent interactions rather than limitations of individual agents. This aligns with findings from research on high-reliability organizations, which indicate that clearly defined principles can prevent catastrophic failures.
MASFT can serve as a framework guiding future research, outlining potential solutions for the 14 failure modes. For instance, to address specification and system design failures, future work could focus on developing clearer and more explicit task specifications, as well as more robust system architectures. To address misalignment among agents, future work could explore improved communication protocols and collaboration mechanisms. To address task validation and termination issues, future work could focus on developing more effective validation procedures and termination standards.
Our findings also indicate the need for developing more effective evaluation metrics and diagnostic tools to identify and address failures in MAS. The LLM-based process we proposed as an evaluator provides a promising avenue for achieving this goal. By automating failure detection and classification, we can accelerate the development and improvement of MAS.
8. Limitations and Future Directions
While our research provides valuable insights into understanding MAS failures, it also has some limitations. First, our study is limited to five open-source MAS. Future work could expand to a broader range of MAS to enhance the generalizability of our findings. Second, our research primarily focuses on the accuracy of task completion. Future work could explore other important metrics, such as efficiency, interpretability, and robustness. Third, our research mainly focuses on single-turn tasks. Future work could explore multi-turn tasks and continuous interaction in MAS.
Moreover, we acknowledge that MASFT is not an exhaustive taxonomy. As the field of MAS continues to evolve, new failure modes may emerge. Therefore, we need to continuously refine and update MASFT to reflect the latest research findings.
Future work could also explore integrating MASFT with other relevant frameworks and theories. For example, we could combine MASFT with human-computer interaction (HCI) theories to better understand user experiences in MAS. We could also integrate MASFT with software engineering theories to better understand the design and implementation of MAS.
Finally, we believe that open-sourcing our dataset and LLM annotations is crucial for advancing research in the MAS field. We hope our work inspires others to explore the challenges and opportunities in MAS and contribute to building more reliable and intelligent multi-agent systems.
In this study, we present the first systematic study of failure modes in multi-agent systems based on large language models (LLMs). We collected and analyzed over 150 trajectories and iteratively refined our taxonomy under the guidance of grounded theory, validating it through cross-annotator studies. We identified 14 granular failure modes and categorized them into three distinct failure categories: specification and system design failures, misalignment among agents, and task validation and termination.
Our findings emphasize the inherent challenges in multi-agent system design and indicate that merely enhancing the capabilities of underlying models may not suffice to address all issues. Many failures stem from challenges in inter-agent interactions rather than limitations of individual agents. This aligns with findings from research on high-reliability organizations, which indicate that clearly defined principles can prevent catastrophic failures.
We propose the Multi-Agent System Failure Taxonomy (MASFT) as a framework guiding future research, outlining potential solutions for the 14 failure modes. For instance, to address specification and system design failures, future work could focus on developing clearer and more explicit task specifications, as well as more robust system architectures. To address misalignment among agents, future work could explore improved communication protocols and collaboration mechanisms. To address task validation and termination issues, future work could focus on developing more effective validation procedures and termination standards.
Our findings also indicate the need for developing more effective evaluation metrics and diagnostic tools to identify and address failures in multi-agent systems. The LLM-based process we proposed as an evaluator provides a promising avenue for achieving this goal. By automating failure detection and classification, we can accelerate the development and improvement of multi-agent systems.
Table 3 summarizes the correspondence between solution strategies and failure categories in multi-agent systems. We attempted two strategies: tactical approaches and structured approaches. Tactical approaches include clear role/task definitions, further discussions, self-validation, dialogue pattern design, etc., primarily targeting specification and system design failures, misalignment among agents, and other issues. Structured approaches focus on comprehensive validation, confidence quantification, standardized communication protocols, probabilistic confidence measures, etc., aimed at enhancing the overall reliability and robustness of the system.
To validate the effectiveness of these strategies, we conducted case studies, experimenting under the AG2 and ChatDev frameworks. Table 4 presents the experimental results, including accuracy comparisons on benchmarks such as GSM-Plus (using GPT-4 and GPT-4o), ProgramDev, and HumanEval. We compared baseline implementations, improved prompts, and redesigned agent topologies across three configurations. Experimental results indicate that while our interventions successfully improved the framework’s performance across different tasks, these improvements were not always significant and varied across different benchmark tests.
In summary, this study provides valuable insights into understanding the failure modes of multi-agent LLM systems and lays the groundwork for building more reliable and intelligent multi-agent systems. Future research can further explore integrating MASFT with other relevant frameworks and theories and developing more effective evaluation metrics and diagnostic tools. We hope our work inspires others to explore the challenges and opportunities in multi-agent systems and contribute to advancements in the field of artificial intelligence.
ref: https://huggingface.co/papers/2503.13657