news

Top of the list of open source AI software engineers, UIUC's agentless solution solves real programming problems

2024-07-15

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]

The authors of this paper are from Professor Zhang Lingming's team at the University of Illinois at Urbana-Champaign (UIUC), including Steven Xia, a fourth-year doctoral student whose research direction is automatic code repair based on AI large models; Deng Yinlin, a fourth-year doctoral student whose research direction is code generation based on AI large models; and Soren Dunn, a research intern and currently a junior at UIUC. Professor Zhang Lingming is currently an associate professor in the Department of Computer Science at UIUC, and is mainly engaged in research related to software engineering, machine learning, and code large models.

For more detailed information, please visit Mr. Zhang’s personal homepage: https://lingming.cs.illinois.edu/

Since the proposal of Devin (the first fully automatic AI software engineer), the design of AI agents for software engineering has become a research focus. More and more agent-based AI automatic software engineers have been proposed, and have achieved impressive performance on the SWE-bench dataset and automatically fixed many real GitHub issues.

However, complex agent systems bring additional overhead and uncertainty. Do we really need to use such complex agents to solve GitHub issues? Can solutions that do not rely on agents approach their performance?

Based on these two problems, Professor Zhang Lingming's team at the University of Illinois at Urbana-Champaign (UIUC) proposed OpenAutoCoder-Agentless, a simple, efficient and fully open-source agentless solution that can solve a real GitHub issue for only $0.34. Agentless has attracted more than 300 GitHub Stars on GitHub in just a few days and ranked among the top three of DAIR.AI's weekly hottest ML papers.



论文:AGENTLESS : Demystifying LLM-based Software Engineering Agents

Paper address: https://huggingface.co/papers/2407.01489

Open source code: https://github.com/OpenAutoCoder/Agentless

“The Agentless framework outperformed all open source agent solutions and nearly reached the top of SWE Bench Lite (27%),” said Leo Boytsov, AWS Research Scientist. “And it beat all open source solutions at a significantly lower cost. The framework uses a hierarchical query approach (looking for files, classes, functions, etc. by asking questions to the LLM) to determine patch locations. While leveraging the LLM, it does not allow the LLM to make planning decisions.”



Agentless is an automated approach to solving software development problems that uses a simple two-stage approach to locate and fix bugs in the code base. In the locate phase, Agentless gradually narrows down to suspicious files, classes/functions, and specific edit locations in a hierarchical manner. For fixes, it uses a simple diff format (referenced from the open source tool Aider) to generate multiple candidate patches, filter them, and sort them.



The researchers compared Agentless to existing AI Software Agents, including state-of-the-art open source and commercial/closed source projects. Surprisingly, Agentless outperformed all existing open source Software Agents at a lower cost! Agentless solved 27.33% of the problems, the highest among open source solutions, and solved each problem for an average of only $0.29, and only cost about $0.34 on average for all problems (both solved and unsolved).



Not only that, but Agentless has the potential to improve. When considering all generated patches, Agentless can solve 41% of the problems, an upper bound that suggests there is significant room for improvement in the patch sorting and selection stages. In addition, Agentless is able to solve some unique problems that even the best commercial tools (Alibaba Lingma Agent) cannot solve, suggesting that it can complement existing tools.



Analysis of the SWE-bench Lite dataset

The researchers also performed manual inspection and detailed analysis of the SWE-bench Lite dataset.

The study found that 4.3% of the problems in the SWE-bench Lite dataset directly gave a complete answer in the problem description, that is, the correct fix patch. The other 10% of the problems described the exact steps of the correct solution. This shows that some problems in SWE-bench Lite may be easier to solve.

In addition, the research team observed that 4.3% of the problems contained user-proposed solutions or steps in the problem description, but these solutions were inconsistent with the developer's actual patches. This further reveals the potential problems of this benchmark, because these misleading solutions may cause AI tools to generate incorrect solutions by simply following the problem description.

In terms of the quality of problem description, the researchers observed that although most of the tasks in SWE-bench Lite contain sufficient information and many tasks also provide failure examples to reproduce the errors, 9.3% of the problems still do not contain sufficient information. For example, a new function needs to be implemented or an error message needs to be added, but the specific function name or specific error message string is not given in the problem description. This means that even if the underlying functionality is implemented correctly, the test will fail if the function name or error message string does not match exactly.



Ofir Press, a Princeton researcher and one of the authors of SWE-Bench, confirmed their findings: "Agentless did a nice manual analysis of SWE-bench Lite. They suggest that the theoretical maximum score on Lite could be 90.7%. I suspect the actual upper limit is probably lower (around 80%). Some questions have insufficient information, others are too rigorously tested."



SWE-bench Lite-S: A filtered subset of rigorous questions

To address these issues, the researchers proposed a strict subset of questions, SWE-bench Lite-S (containing 252 questions). Specifically, those questions that contained exact patches, misleading solutions, or did not provide enough information in the problem description were excluded from SWE-bench Lite (containing 300 questions). This removes unreasonable questions and standardizes the difficulty level of the benchmark. Compared with the original SWE-bench Lite, the filtered benchmark more accurately reflects the true capabilities of automatic software development tools.

Conclusion

Although agent-based software development is very promising, the authors believe that it is time for the technology and research community to stop and think about its key design and evaluation methods instead of rushing to release more agents. The researchers hope that Agentless can help reset the baseline and direction of future software engineering agents.