OpenDevin has released a technical report, a must-read for large model agent developers

2024-08-02

Machine Heart Report

Editors: Chen Chen, Zenan

Popular general-purpose large-model Agent platform.

In March this year, Devin, the "world's first AI software engineer," set off a storm in the AI community. Unlike previous AI programming assistants, Devin is not just an auxiliary programming role, but can independently and end-to-end complete the entire development project.

The birth of Devin made us appreciate the powerful capabilities of large model agents. Soon, many open source projects appeared in the industry trying to replicate it, among which OpenDevin stood out and received the most attention.

OpenDevin is a platform for developing general-purpose agents that interact with the world through software. Features include:

Interaction mechanisms between large model agents, interfaces and environments;

Agent available sandbox operating system + web browser environment;

An interface that can create and execute code;

Multi-Agent support;

Assessment framework.

Currently, OpenDevin's GitHub has received more than 29,000 Stars.

Recently, the OpenaDevin team released a technical report on the tool.

Report address: https://arxiv.org/pdf/2407.16741

In a technical report, the authors of OpenDevin, scholars from the University of Illinois at Urbana-Champaign, Carnegie Mellon University and other institutions, detailed OpenDevin, a community-driven platform designed to develop general and specialized AI agents that interact with the world through software.

More importantly, OpenDevin is not only a conceptual framework, it also includes a comprehensive and ready-to-use agent, environment, and evaluation implementation. As of the time of this report, OpenDevin includes an Agent Center, where more than 10 agents have been implemented, including a powerful general agent based on the CodeAct architecture, and added for web browsing and code editing. User interaction with the agent is achieved through a chat interface, which visualizes the agent's current operations and allows real-time feedback. In addition, the evaluation framework currently supports 15 benchmarks that can be used to evaluate the performance of the agent.

OpenDevin Architecture

In this paper, the authors describe OpenDevin from the following aspects: (1) how to define and implement an agent; (2) how action execution promotes observation; (3) how to manage and extend the skills commonly used by agents; and (4) how to combine multiple agents together to solve tasks.

How to define and implement an agent

The agent can perceive the state of the environment and generate actions to be performed when solving a user-specified task.

State and event streams. In OpenDevin, a state is a data structure that encapsulates all relevant information about an agent performing a task. A key component of this state is the event stream, which is a collection of past actions and observations in chronological order.

Actions. Inspired by CodeAct, OpenDevin connects agents to the environment through a core set of actions. The actions IPythonRunCellAction and CmdRunAction enable agents to execute arbitrary Python code and bash commands within a sandbox environment (e.g., a securely isolated Linux operating system). And BrowserInteractiveAction supports agents to interact with web browsers.

Observation. An observation describes a change in the environment that the agent observes. It may or may not be caused by the agent's actions: it can be 1) a natural language instruction from the user, or 2) the result of the agent's previous actions (e.g., code execution results, etc.).

Implement new agents. Agents are designed to be simple yet powerful, allowing users to easily create and customize agents for a variety of tasks. The core is the step function, which takes the current state as input and generates appropriate actions based on the logic of the agent. Figure 2 shows a simplified example code of the agent abstraction.

Observe the results of action execution

Agent Runtime provides agents with an action space comparable to that of human software developers, enabling OpenDevin to handle a variety of software development and web-based tasks, including complex software development workflows, data analysis projects, web browsing tasks, etc. It allows agents to access bash terminals to run code and command-line tools, leverage Jupyter notebooks to write and execute code on the fly, and interact with web browsers to perform web-based tasks (e.g., information search).

Scalable Agent-Computer Interface

The authors built the AgentSkills library, a toolbox designed to enhance the capabilities of intelligent agents, providing utilities that are not easily accessible from basic bash commands or python code.

Multi-agent interaction

OpenDevin allows multiple agents to interact. To achieve this, the authors use a special action type, AgentDelegateAction, which allows an agent to delegate a specific subtask to another agent.

Evaluate

This section compares OpenDevin (abbreviated as OD in the following experimental results) with open source and reproducible baseline methods. These 15 benchmarks cover tasks such as software engineering and web browsing.

Table 3 shows that although the OpenDevin agent may not achieve the best performance in every category, it is designed with generality in mind.

Table 4 reports the results of the agent on software engineering benchmarks.

in particular:

SWE-bench is designed to evaluate the ability of agents to solve GitHub issues, such as bug reports or feature requests. As shown in Table 4, the latest version of CodeActAgent v1.8, based on claude-3.5-sonnet, solves 26% of the issues compared to other open source agents dedicated to software development.

HumanEvalFix. OpenDevin CodeActAgent successfully fixes 79.3% of the errors in the Python split, significantly outperforming all non-agent methods and almost doubling the performance of StarCoder2-15B.

The GPT-4o-based OpenDevin agent achieved the highest success rate of 76.47% on ML-Bench, outperforming SWE-Agent (42.64%).

Gorilla APIBench examines the ability of an agent to use an API. OpenDevin using GPT-4o achieved a success rate of 36.4%, outperforming a baseline that was not specifically fine-tuned for API calls.

ToolQA evaluates the ability of an agent to use external tools. OpenDevin with GPT-4o shows the highest performance compared to all baselines. The agent performs better on tasks related to CSV and Database tool usage, but needs improvement in Math and Calculator tool usage.

Table 5 reports the evaluation results on the web browsing benchmark.

Table 6 reports the results of various auxiliary benchmarks.

Among them, GAIA is used to evaluate the ability of the agent to solve general tasks. The results show that the agent scored 32.1 points on GAIA, which is a significant improvement over the original AutoGPT.

GPQA is used to evaluate the ability of the agent to coordinate the use of tools when solving challenging graduate-level problems. The results are shown in Tables 6 and 7. OpenDevin integrates functions that support the use of multiple tools and web search, enabling the agent to better solve complex multi-step problems.

For more results, please refer to the original paper.

news

OpenDevin has released a technical report, a must-read for large model agent developers

Introduction

my contact information