Chapter 6: Multi-Agent Systems

1. Why Do We Need Multiple Agents? Division of Labor, Fault Tolerance, and Coverage of Specialized Fields

A single agent can accomplish a “task,” but when faced with complex projects, multi-role collaboration, and high fault tolerance requirements, it becomes inadequate.

Multi-Agent Systems (MAS) provide core value by enabling multiple specialized agents to collaborate, working in parallel with clear division of labor, mutual verification, and dynamic adjustment, much like a company.

🧠 Academic Background Supplement

The concept of multi-agent systems is not new; its theoretical roots can be traced back to:

Distributed Artificial Intelligence (DAI): In the 1980s, research focused on how multiple AI entities collaborate to solve problems.
Game Theory: Nash equilibrium and incentive mechanism design for competition and cooperation among agents.
Organizational Modeling: Mimicking the hierarchical structure and role division of enterprises/military.
Swarm Intelligence: Inspired by ant colonies and bird flocks, self-organized collaboration without central control.

🎯 Limitations of a Single Agent

Issue	Example	Multi-Agent Solution
Single Capability	An agent needs to write reports and create graphics, compromising quality.	Split into “Researcher + Writer + Designer” agents.
No Fault Tolerance	If the report-writing agent makes an error, the entire process fails.	Introduce a “Reviewer Agent” to verify results.
Inefficiency	Sequential execution: data retrieval → report writing → PPT creation.	Parallel execution: all three tasks occur simultaneously.
Lack of Specialized Depth	A generalist agent struggles to master law, finance, and design.	Each agent focuses on one domain.

🔄 Core Advantages of Multi-Agent Systems

✅ Specialized Division of Labor: Each agent only does what it excels at.
✅ Parallel Efficiency: Multiple tasks executed simultaneously, reducing total time.
✅ Enhanced Fault Tolerance: If one agent fails, others can take over or retry.
✅ Scalability: New functionalities can be added by introducing new agents without affecting the existing system.
✅ Competitive Optimization: Multiple agents provide different solutions, allowing for optimal selection (e.g., bidding, voting).

💡 Summary in One Sentence:A single agent is like a “full-stack engineer,” while multiple agents resemble a “product company” — project manager + development + testing + design, each performing their role.

2. Multi-Agent Architecture Patterns

A multi-agent system is not just “a bunch of agents running around”; it has a clear collaborative architecture. There are five mainstream patterns:

🖼️ [Suggested Image Location] Overview of Multi-Agent Collaboration Patterns

[Role-based]    [Pipeline]      [Auction-based]    [Hierarchical]    [Swarm Intelligence]
   👥             🏭             💰             🏢             🐜
 Collaborative Dialogue       Sequential Transfer       Competitive Selection       Top-down Command       Local Interaction

🧑💻 1. Role-based Collaboration

Each agent has a fixed role and responsibility.
Collaboration occurs through “dialogue” or “shared blackboard.”
Applicable Scenarios: Content creation, code review, customer service teams.

📖 Example: Academic Paper Writing Team

- Researcher Agent: Literature review, data organization
- Writer Agent: Drafting initial version
- Reviewer Agent: Checking logic, grammar, format
- Editor Agent: Polishing language, adjusting structure
- Publisher Agent: Generating PDF, submitting to system

✅ Advantages: Clear responsibilities, easy to debug.❌ Challenges: Need to design handover protocols between roles.

🏭 2. Pipeline

Agents execute in a fixed order, with the output of one serving as the input for the next.
Similar to a factory assembly line.
Applicable Scenarios: Data processing, report generation, approval processes.

📖 Example: Market Report Generation Pipeline

[Market Research Agent] → [Data Analysis Agent] → [Report Writing Agent] → [PPT Generation Agent] → [Email Sending Agent]

✅ Advantages: Simple structure, easy to implement.❌ Challenges: Preceding agents can block → delays across the entire chain.

💰 3. Auction-based / Competitive

Multiple agents compete for the same task, with the lowest bidder (or highest quality) winning.
Theoretical Basis: Game theory + auction mechanism design.
Key Mechanisms

Incentive Compatibility: Reward mechanisms encourage agents to quote honestly.
Strategic Equilibrium: Avoid malicious underbidding or inflated quotes.

Applicable Scenarios: Resource allocation, solution selection, creative generation.

📖 Example: Generating Advertising Copy

User: Generate 3 promotional copies, select the best one.

→ 3 Copywriter Agents generate copies simultaneously.
→ Reviewer Agent scores (creativity, compliance, conversion).
→ Select the highest-scoring copy, reward the corresponding agent (Token/points).
→ Other copies stored in the knowledge base for future use.

✅ Advantages: Stimulates diversity, avoids rigid thinking.❌ Challenges: Requires fair scoring and incentive mechanisms.

🏢 4. Hierarchical

Mimics corporate organization: Manager Agent → Sub-agent.
Manager is responsible for task breakdown, resource allocation, and result acceptance.
Sub-agents are responsible for specific execution.
Applicable Scenarios: Large project management, military command, enterprise ERP.

📖 Example: Software Development Team

[CTO Agent] → Break down requirements → Assign to [PM Agent]
[PM Agent] → Develop plan → Assign to [Dev Agent], [Test Agent]
[Dev Agent] → Coding → Deliver to [Test Agent]
[Test Agent] → Testing → Feedback to [PM Agent]
[PM Agent] → Acceptance → Report to [CTO Agent]

✅ Advantages: Suitable for extremely complex tasks, clear responsibility chain.❌ Challenges: Manager Agent requires strong planning capabilities.

🐜 5. Swarm Intelligence

No central control; agents achieve global goals through local interactions.
Inspired by ant colonies, bird flocks, and fish schools.
Core Mechanisms: Pheromones, local rules, positive feedback.
Applicable Scenarios: Robot swarms, traffic scheduling, high-frequency trading.

📖 Example: Logistics Path Optimization

100 logistics agents simultaneously explore delivery paths.
→ Each agent records "path time" and broadcasts.
→ Other agents prioritize "low time paths."
→ Ultimately converge to the globally optimal path.

✅ Advantages: High fault tolerance, adaptive, scalable.❌ Challenges: Slow convergence, requires a large number of agents.

🆚 Comparison Table of Five Patterns

Pattern	Collaboration Method	Advantages	Disadvantages	Applicable Scenarios
Role-based	Dialogue/Blackboard Sharing	Strong specialization, high fault tolerance	Complex protocols, difficult debugging	Content creation, code review
Pipeline	Sequential Transfer	Simple structure, easy to implement	Blocking risk, low flexibility	Report generation, approval flow
Auction-based	Competition + Scoring	High diversity, encourages innovation	Complex mechanisms, high costs	Creative generation, resource allocation
Hierarchical	Top-down Command	Suitable for extremely complex projects	Heavy burden on Manager	Enterprise management, military command
Swarm Intelligence	Local Interaction	High fault tolerance, adaptive	Slow convergence, difficult to control	Robotics, traffic scheduling

🧭 Selection Recommendations:

Beginners → Pipeline

Professional Scenarios → Role-based / Hierarchical

Creative/Selection Scenarios → Auction-based

Physical/Financial Scenarios → Swarm Intelligence

3. Communication Protocols Between Agents: Message Format, Dialogue Management, State Synchronization

The core of multi-agent collaboration is communication — how do they “talk”? How do they “synchronize states”? How do they “avoid conflicts”?

🖼️ [Suggested Image Location] Multi-Agent Communication Architecture Diagram

[Agent A] ↔ [Message Broker] ↔ [Agent B]
           ↕
     [Shared Blackboard]
           ↕
      [Event Bus]

📨 1. Message Format

Communication between agents must be structured; a standard format is recommended:

{
  "from":"researcher_agent",
  "to":"writer_agent",
  "content":{
      "task_id":"TASK20250601",
      "data":{ ... },
      "metadata":{
        "priority":"high",
        "deadline":"2025-06-02T10:00:00Z"
      }
  },
  "timestamp":"2025-06-01T09:30:00Z"
}

✅ Key Fields: from/to, content, metadata, timestamp.❌ Avoid: pure natural language communication (prone to ambiguity, hard to parse).

💬 2. Dialogue Management

Dialogue ID: Each task has a unique ID for tracking.
Dialogue Status: pending / running / completed / failed.
Timeout Mechanism: If an agent does not respond → automatically retry or escalate.

class ConversationManager:
    def __init__(self):
        self.conversations = {}  # task_id → {status, messages, timeout}

    def send_message(self, task_id, from_agent, to_agent, content):
        # Record message + update status
        pass

    def check_timeout(self, task_id):
        # Trigger retry or alert on timeout
        pass

🔄 3. State Synchronization

Shared Blackboard: A central storage that all agents can read and write.
Event-driven: Agents publish events, others subscribe.
Database Synchronization: Use Redis / PostgreSQL to store shared states.

🛡️ Security and Robustness Enhancement (New):

Risk	Defense Strategy
Malicious Agent Injection	Role whitelist + digital signature + behavior auditing.
Data Pollution	Input validation + output verification + version rollback.
Role Conflict	Priority arbitration + locking mechanism + timeout preemption.
Communication Hijacking	TLS encryption + message authentication + access control.

⚠️ Common Communication Issues and Responses

Issue	Manifestation	Solution
Message Loss	Agent does not receive instructions.	ACK confirmation + retransmission mechanism.
Deadlock	A waits for B, B waits for A.	Set timeout + priority arbitration.
State Inconsistency	Two agents use different data.	Central state management + versioning.
High Communication Costs	Frequent message passing consumes tokens.	Batch sending + local caching.

4. Practical Example: Building a Multi-Agent Pipeline for “Market Research → Report Writing → PPT Generation”

Next, we will implement a complete multi-agent pipeline using LangChain + LLM.

🎯 Goal: User inputs industry → Automatically generate market analysis report + PPT.

🌐 Extended Cases: Cross-Domain Multi-Agent Applications (New)

🧪 Case 1: Research Collaboration Agents (Papers + Experiments + Graphics)

[Literature Review Agent] → [Experiment Design Agent] → [Data Simulation Agent] → [Graphics Agent] → [Paper Writing Agent]

📈 Case 2: Financial Trading Agents (Analysis + Decision + Execution)

[Macro Analysis Agent] → [Industry Research Agent] → [Quantitative Strategy Agent] → [Trading Execution Agent] → [Risk Control Agent]

🤖 Case 3: Robot Swarm Agents (Exploration + Collaboration + Obstacle Avoidance)

[Leader Robot] → Assign Areas → [Scout Robot] Explores → [Mapper Robot] Maps → [Carrier Robot] Transports

🧩 System Architecture (Market Report Case)

[User] 
   ↓
[Coordinator Agent] → Assign tasks, monitor progress
   ↓
[Market Research Agent] → Call search engines/RAG, output data
   ↓
[Report Writing Agent] → Generate Markdown report based on data
   ↓
[PPT Generation Agent] → Convert report to PPT (using python-pptx)
   ↓
[Delivery Agent] → Save file + notify user

🛠️ Code Implementation (Simplified Version, Core Logic)

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
import asyncio

llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)

# 1. Market Research Agent
async def market_research_agent(industry: str) -> str:
    prompt = f"Please research the {industry} industry, output: market size, growth rate, Top 3 companies, trend analysis."
    response = await llm.ainvoke([HumanMessage(content=prompt)])
    return response.content

# 2. Report Writing Agent
async def report_writer_agent(data: str) -> str:
    prompt = f"Write a report based on the following data:\n{data}\nRequirements: clear structure, including title, abstract, body, conclusion."
    response = await llm.ainvoke([HumanMessage(content=prompt)])
    return response.content

# 3. PPT Generation Agent (Simplified: Generate PPT Outline)
async def ppt_agent(report: str) -> str:
    prompt = f"Convert the following report into a PPT outline, one title + 3 key points per slide:\n{report}"
    response = await llm.ainvoke([HumanMessage(content=prompt)])
    return response.content

# 4. Main Process (Pipeline)
async def multi_agent_pipeline(industry: str):
    print("🔍 Step 1: Market Research...")
    data = await market_research_agent(industry)
    
    print("📝 Step 2: Writing Report...")
    report = await report_writer_agent(data)
    
    print("📊 Step 3: Generating PPT...")
    ppt = await ppt_agent(report)
    
    print("✅ Done!")
    return {"data": data, "report": report, "ppt": ppt}

# Run
result = asyncio.run(multi_agent_pipeline("New Energy Vehicles"))
print(result["ppt"][:500])  # Print the first 500 characters of the PPT outline

🖨️ Output Example:

PPT Outline:
Slide 1: Analysis of the New Energy Vehicle Industry
- Market Size: ¥50 billion
- Growth Rate: 12%
- Trends: Intelligence, Globalization

Slide 2: Top 3 Competitors
- Company A: Market Share 30%
- Company B: Market Share 25%
- Company C: Market Share 20%
...

⚙️ Engineering Optimization Suggestions

Parallelization: Research + Data Analysis can be parallelized.
Cache: Cache results of the same industry research for 1 day.
Fault Tolerance: If any agent fails → retry 2 times → escalate to human.
Cost Control: Use gpt-3.5 to generate drafts, gpt-4 for polishing.

5. Evaluation and Challenges of Multi-Agent Systems

How to measure whether a multi-agent system is “usable”? It needs to be evaluated from four dimensions:

📊 Evaluation Index System

Dimension	Index	Description	Measurement Method
Effectiveness	Task Success Rate	Whether the final output meets user needs.	Manual evaluation / Automated testing.
Output Accuracy	Whether data/conclusions are accurate.	Comparison with standard answers.
Consistency	Whether results are stable across multiple runs.	Variance calculation.
Efficiency	Latency	Total time from input to output.	Timer.
Throughput	Number of tasks processed per unit time.	Stress testing.
Parallelism	Number of concurrently active agents / Total number of agents.	Monitoring logs.
Cost	Total Token Consumption	Total tokens used by all agents calling LLM.	API billing statistics.
Computational Resource Consumption	CPU/GPU/Memory usage.	System monitoring.
Robustness	Fault Tolerance Rate	Proportion of successful system recovery after agent failure.	Fault injection testing.
Recovery Time from Anomaly	Time from failure to normal recovery.	Log analysis.
Success Rate of Defense Against Adversarial Attacks	Success rate of resisting malicious inputs/role injections.	Red team testing.

📌 Recommendation: Conduct “fault injection testing” and “stress testing” before going live to ensure stability in the production environment.

✅ Summary of This Chapter

Multi-agent systems originate from distributed AI, game theory, and organizational modeling, and have gained traction in recent years due to the explosion of large models.
Five architectural patterns: role-based, pipeline, auction-based (including game theory), hierarchical, and swarm intelligence.
Communication protocols are core: structured messages, dialogue state management, synchronization of shared data, and defense against malicious attacks are essential.
Leading frameworks: AutoGen, ChatDev, MetaGPT, CAMEL, AgentScope each have applicable scenarios.
Evaluation requires four dimensions: effectiveness (success rate/accuracy), efficiency (latency/throughput), cost (tokens/resources), and robustness (fault tolerance/attack resistance).
Applications extend beyond content generation, empowering research, finance, robotics, and industrial sectors.
More is not always better: the number of agents should balance complexity and benefits, with 3-5 being the sweet spot.