To Use or Not to Use Multi-Agents? The Ultimate Question Ignited by Anthropic and Devin

Recently, an interesting event occurred in the AI community.

The protagonists are two major players in the AI field: one is Anthropic, which released Claude, and the other is Cognition AI, which became globally popular with its first AI programmer, Devin. Just in the past few days, it seems these two companies have coordinated to publish significant articles about “Multi-Agent” systems.

What’s interesting? On the surface, the titles and core viewpoints of these two articles are like a clash of titans.

Anthropic’s article is titled “How We Built Our Multi-Agent Research System,” which elaborately describes their successful experience in constructing complex multi-agent systems, exuding confidence with the tone of “we’ve got it right, and the results are outstanding.”

On the other hand, Cognition, Devin’s parent company, published a more provocative article with a straightforward title—”Don’t Build Multi-Agents.” The core argument of the article is that current multi-agent collaboration is very fragile, prone to failure, and represents a “wrong way to build.”

See? One says, “We did it this way, and it’s great,” while the other warns, “Don’t do it this way; it will fail.” This is quite dramatic. To an outsider, it looks like a battle of titans. Is it a rivalry over the future development paradigm of AI Agents?

However, if you read both articles together carefully, you will find that this is not a quarrel; it’s a duet between experts! Cognition’s article is not refuting Anthropic; instead, it uses the entire article to deeply annotate a “warning paragraph” in Anthropic’s article that most of us have overlooked.

What does this mean? In Anthropic’s article, which introduces its success, there is a seemingly inconspicuous but crucial statement:

“… It is not suitable for multi-agent systems in areas where all agents need to share the same context or where there are many dependencies between agents. For example, the number of parallelizable tasks involved in most coding tasks is much less than that in research tasks…”

Did you notice? Anthropic has already put the “ugly truth” on the table: our method works well, but there is a prerequisite: the task’s “dependency level” must be low, and it should be able to be broken down into several unrelated sub-tasks, like research tasks. However, tasks like “writing code,” which have a high dependency, cannot be handled by our method.

Now, looking at Cognition’s article, what does it say? It uses the example of “making a Flappy Bird game” to repeatedly argue why two parallel agents (one drawing the background and the other drawing the bird) cannot produce usable results if they cannot share context. Isn’t this precisely explaining why the “coding tasks” warned by Anthropic do not work?

Thus, the truth is revealed. These two articles are not contradictory; they are perfectly complementary. Anthropic has drawn a target called “low-dependency, parallelizable tasks” and demonstrated its exceptional accuracy. Meanwhile, Cognition points to the adjacent area of “high-dependency, tightly coupled” tasks, loudly reminding everyone: “Don’t shoot here; it’s all pitfalls!”

The Dragon-Slaying Method: The Four Questions Dependency Test

Now that we know the key to using multi-agents lies not in the quantity of “multi” but in the “dependency level” of the task itself. Low dependency? Feel free to use multi-agents in parallel and achieve miracles. High dependency? Whoever uses it will crash.

So, how can we ordinary developers assess the dependency level of a task? We can’t rely on intuition every time.

Don’t worry! By summarizing the wisdom of these two top players, we can distill a very practical “dragon-slaying method,” which I call the “Four Questions Dependency Test.” Before you attempt to break down a task, ask yourself these four questions:

The Isolation Test: If you assign sub-task A and B to two people locked in separate rooms with no phone calls, can the results they produce be directly combined? If yes (e.g., A writes Tesla’s financial report, B writes GM’s financial report), then it’s low dependency. If no (A draws the background, B draws the character), then it’s high dependency.
The Order Agnostic Test: Can the execution order of sub-tasks A and B be swapped? Can they start simultaneously? If yes (it doesn’t matter who writes the financial reports first), it’s low dependency. If A must be completed before B (e.g., the database table structure must be built before data can be written), then it’s high dependency.
The Shared Resource Test: Do these sub-tasks need to modify the same thing? For example, the same file or variable. If they only need to read, that’s fine. But if they both need to modify, it’s like several people in a construction team trying to build the same wall at the same time; it’s bound to cause conflict. Tasks that frequently modify shared resources are high dependency.
The Integration Cost Test: Is the final result of all sub-tasks simply a matter of “stacking together” (aggregation) or does it require complex “blending together” (composition)? Combining several research reports into one document is aggregation, which has low cost. However, weaving several functions tightly together to form a new feature is composition, which has high cost.

Test Dimension	Low Dependency (Feel free to use multi-agents)	High Dependency (Don’t use directly)
Isolation Test	Can be completed independently in isolation	Results conflict and cannot be used directly
Order Agnostic	Order can be swapped, can be parallel	There is a strict execution order
Shared Resource	No modification or only read shared resources	Frequent read/write of the same shared resource
Integration Cost	Simple result aggregation	Complex result composition

With this method, we can diagnose any task like an experienced physician, clearly determining whether it is suitable for multi-agent parallel processing.

Challenging the Ultimate Boss: How Can AI Elegantly Write Code?

Now that we have this method, we must challenge the most difficult boss—programming.

Using the “Four Questions Test,” you will find that programming tasks light up red on almost every question. They cannot be isolated, have strict order, share the entire codebase, and have extremely high integration costs.

This perfectly confirms Cognition’s warning and Anthropic’s self-awareness.

So, is multi-agent programming really hopeless?

I can’t claim to have the answer. After all, even giants like Anthropic admit this is a significant challenge today. If I say, “I can do it,” that would be madness.

However, the issue of “communication costs skyrocketing between multiple actors, leading to collaboration failures” is not new. It reminds me of a classic discussion in the history of human software development. Decades ago, when projects were entirely human-led, the industry faced the same dilemma: as the team size increased, the cost of communication and information synchronization exploded exponentially, ultimately leading to project delays or even failures.

At that time, a book regarded as the bible of the computing field—”The Mythical Man-Month”—delved deeply into this issue.

The book proposed a forward-looking solution called the “Surgical Team” model. Its core idea is that to ensure the “conceptual integrity” of software design, a “surgeon” (chief programmer) should lead all core design and coding, supported by a specialized team with clearly defined roles, such as assistants, testers, document editors, and toolsmiths.

To be honest, this model has always been difficult to implement perfectly in the human world. The reason is simple: finding an all-around “surgeon” is too difficult, and maintaining such a top-tier team is too costly.

However, in the AI era, wemay see the possibility of reviving this ancient wisdom. Why? Because certain characteristics of AImight just overcome several obstacles of this model.

Could a top-tier LLM possibly become that tireless, emotionless, knowledgeable “surgeon”? And those clearly defined, task-oriented supporting roles could be played by other specially optimized, smaller, and cheaper AI agents,which seems to become feasible in terms of cost and efficiency.

Thus, an advanced “AI programming team” is born:

“Surgeon” AI (Main Control Agent): Responsible for overall architecture design, defining interfaces, and writing the most critical code.
“Assistant” AI (Review Agent): Responsible for reviewing every line of code written by the “surgeon” and providing optimization suggestions, acting as a reviewer.
“Tester” AI (Testing Agent): Automatically generates and executes unit tests based on code functionality to find bugs.
“Documentation” AI (Documentation Agent): Automatically generates clear documentation and comments for the code.
“Version Manager” AI (Staff Agent): Responsible for managing code versions, merging branches, and recording all operation logs.

Perhaps, through such a structure, AIs will no longer write code chaotically in parallel but will collaborate within a framework similar to modern software engineering, with clear divisions of labor and processes (PR, CI/CD). They will successfully transform a high-dependency complex problem into a structured, process-oriented engineering problem.

So, returning to the initial question. Did Anthropic and Devin clash? No. They are simply revealing the next stage of AI Agent development from different angles:

We are transitioning from an era of “making individual AIs smarter” to an era of “how to organize a group of AIs to be smarter.”

The Dragon-Slaying Method: The Four Questions Dependency Test

Challenging the Ultimate Boss: How Can AI Elegantly Write Code?

Related posts

Leave a Comment Cancel reply