2024-08-12
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
New Intelligence Report
Editor: Peach
【New Wisdom Introduction】Have you ever thought that perhaps one day in the future, the AI army will be able to fully shoulder the heavy responsibilities of the company, and will humans become a supporting role?
Zuckerberg firmly believes that "in the future, there will be more AI entities in the world than humans."
So, what if these AIs also have corporate culture?
Are they like humans, with both AIs that hold decision-making power and AIs that work hard?
A few months ago, OpenAI was revealed to have internally defined a five-level AGI route, L5 - Organizer: AI that can complete organizational work.
This may be the organizational chart of the company in the future.
It is on the rise because of the collaboration of multiple intelligent agents.
Previously, a study showed that a system with 30+ AI agents outperformed simple LLM calls in almost any task, while also reducing hallucinations and improving accuracy.
Paper address: https://arxiv.org/pdf/2402.05120
However, how should multiple intelligent agents actually collaborate?
While exploring ways to improve AI’s performance on software engineering tasks, Alex Sima had an epiphany:
What would happen if the interactions between AI agents were institutionalized, similar to the "organizational chart" of a tech giant?
Next, Alex let AI take over the six major technology giants - Amazon, Google, Microsoft, Apple, Meta, and Oracle to see how they collaborate.
Let’s take a look at a picture first and get a feel for it.
Key Takeaways
Here are some key takeaways from Alex’s organization of AI agents into a company structure similar to Apple, Microsoft, Google, etc.:
- Companies with multiple “competing” teams (i.e. competing to produce the best end product), such as Microsoft and Apple, outperform centralized hierarchies.
- Systems with single points of failure (such as one leader making important decisions), such as Google, Amazon, and Oracle, perform poorly.
- The organizational structure of large technology companies has a modest but significant impact on problem-solving capabilities.
AI Agents and Technology Giants
Previous approaches to improving performance by simply increasing the number of AI agents, such as SWE-bench, have not achieved significant results.
This shows that simply relying on increasing quantity cannot solve the problem.
So, what other ways are there to make AI agents better at software engineering?
Three weeks ago, Alex came across an article by James Huckle on "Conway's Law" - software and product architecture are destined to reflect the organizational structure that created it.
James showed an illustration revealing the dramatic organizational structures of Amazon, Google, Facebook, Microsoft, Apple, and Oracle, and offered an idea:
Just like humans in large tech companies, multi-agent communication structures may shape problem-solving approaches.
Alex was inspired and decided to test James' hypothesis on a SWE-bench instance.
Experimental setup
The authors organized AI agents into different corporate structures and evaluated six different organizational structures on the "mini" subset of 13 instances of SWE-bench-lite.
In building these six organizations, he designed the multi-agent organizational structure based on a few core observations:
Amazon
There is a binary tree of "managers" at the top level.
To replicate this structure, Alex uses a large number of agents that perform codebase searches, and a single agent that ultimately performs codebase updates.
Similar to Amazon's tree structure, but with more connections between the middle layers.
Alex replicates the results of all agents by aggregating them within a single layer and passing them to the agents in the next layer.
Meta(Facebook)
Lacks a hierarchical structure, but is still a mesh with many connections between agents.
Alex modifies the original agent design by adding the possibility of transitioning between different agents.
Microsoft
Emphasis on competing teams, each with its own hierarchy.
Essentially, Alex restructured Amazon (reduced the number of agents) and used a vector similarity voting method to select the "best" solution from three separate runs (each run slightly tweaking the hierarchy).
apple
Many small competing teams, each with its own minimal structure.
Alex used the same “best solution” approach as Microsoft, but performed more runs without the agent hierarchy (with different transformations in each run).
Oracle
There are two different teams, a larger "legal" binary tree and a smaller engineering tree.
Alex explains the legal team as agents that search the codebase and retrieve key context, while the engineering team consists of agents that actually write the code.
The structure of the two teams is similar to that of Amazon, with a single agent at the top coordinating the flow of information between "legal" and "engineering."
Evaluation results
To evaluate each set of patches on SWE-bench, the authors used SWE-bench evaluation.
The results are as follows:
Organizational Chart Performance Analysis
Here are some of the author’s observations on how different company structures affect performance:
- Competitive teams increase chances of success.
The two best performers (Microsoft and Apple) have multiple teams competing to solve the problem, while the others seem to have just one huge team producing a single patch.
Multiple teams allow for more diversity in problem-solving approaches, increasing the probability of problem resolution.
- Structures with single points of failure perform poorly.
When we talk about single points of failure, we are referring to companies where senior managers/agents can completely change operational outcomes (such as Google, Amazon, and Oracle).
A common problem when coordinating interactions between multiple agents is the failure of one agent - leading to a situation where one agent may change the direction of the team's problem-solving strategy.
Companies with single points of failure are susceptible to these problems.
Additionally, the two best-performing companies, Microsoft and Apple, happen to be the two largest technology companies in the world by market capitalization.
It turns out that the organizational structures that seem to work best in the real world also work best for AI agents.
Screenshot from CompaniesMarketCap, July 25, 2024
Thoughts on the progress of SWE-bench
Looking at the results for different company structures, this is to be expected on this Mini benchmark.
Overall, it seems that in a complex task like software engineering, adding more agents, or changing the way these agents are organized, only leads to marginal performance improvements.
While the paper More Agents Is All You Need found a sizeable improvement in accuracy (about 20%), performance clearly flattens out after 30 agents on the GSM8K (primary school math) test.
The study also found that overly complex tasks (such as those in SWE-bench) may exceed the model's reasoning capabilities, resulting in diminishing performance gains.
We also validated this finding in SIMA, which showed at best an improvement of 2-3% over the base architecture (using more than 40 agents).
He expects this small improvement to be consistent in other non-multi-agent architectures.
The authors argue that achieving greater progress on benchmarks requires changes to the agents’ actual logical reasoning abilities, or the strategies and methods they can adopt (or are given) to solve software problems.
This could be achieved either through a more powerful base model (GPT-5) or by giving the agent a wider range of tools.
It's the same as running a company.
The bottom line is, if you don’t hire smarter employees, or give them better resources, their output won’t improve no matter how you organize them or how many people you have.
Admittedly, the performance of 13 instances may be far from the actual performance of the full benchmark.
The difference in this mini-subset alone is significant enough to be worth noting (an increase of ~50% from Google to Apple).
The underlying models/tools may be a limiting factor in agent software engineering, but as the underlying models improve, exploring agent communication structures (whether in a corporate organization or not) should definitely be tested.
As James Huckle said, this concept may become a "key hyperparameter" in the design of AI agents, and different organizational structures may be better suited for different tasks.
References:
https://alexsima.substack.com/p/ai-multi-agents-with-corporate-structures