Simultaneously control mobile phones and computers, 100 tasks, cross-system intelligent agent evaluation benchmark is now available

2024-08-14

Ixiv column is the column where Machine Heart publishes academic and technical content. In the past few years, Machine Heart AIxiv column has received more than 2,000 reports, covering the top laboratories of major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, you are welcome to submit or contact us for reporting. Submission email: [email protected]; [email protected]

The cross-platform multimodal agent benchmark CRAB is led by the CAMEL AI community and developed by researchers from Oxford, Stanford, Harvard, KAUST, Eigent AI, etc. The CAMEL framework developed by the CAMEL AI community is the earliest multi-agent open source project based on large language models, so most of the community members are researchers and engineers with rich scientific research and practical experience in the field of agents.

AI Agent is one of the most attractive research directions in the current large language model community. Users only need to put forward their own needs.The agent framework can schedule multiple LLMs and support multi-agents to complete user-given tasks in a collaborative or competitive manner.。

Currently, intelligent agents are increasingly combined with large multimodal models (MLMs).Supports execution of tasks in a variety of graphical user interface (GUI) environments, including web, desktop, and smartphonesHowever, the current benchmarks for evaluating the performance of such agents still have many limitations, such as the complexity of building tasks and test environments, and the singleness of evaluation indicators.

To address these problems, this paper proposes a new cross-environment intelligent agent benchmarking framework CRAB.CRAB adopts a graph-based fine-grained evaluation method and provides efficient task and evaluator construction tools. The research team of this article also developed a cross-platform test dataset CRAB Benchmark-v0 based on the CRAB framework, which covers 100 tasks that can be performed in PC and smartphone environments, including traditional single-platform tasks and complex cross-platform tasks that must be completed by operating multiple devices simultaneously.

Paper title: CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
Paper address: https://arxiv.org/abs/2407.01511
Code repository: https://github.com/camel-ai/crab

The authors selected four currently popular multimodal models for preliminary experiments. The experimental results showed that the single-agent structure using GPT-4o as the reasoning engine had the highest test point completion rate of 35.26%.

introduction

As a new agent evaluation benchmark framework, CRAB (Cross-environment Agent Benchmark) is mainly used to evaluate the performance of agents based on multimodal language models (MLMs) in cross-environment tasks.CRAB can simulate the real-world scenario where human users use multiple devices simultaneously to complete complex tasks.,As shown in the Demo, CRAB can be used to evaluate the ,agent that simultaneously manipulates an Ubuntu desktop system and an Android phone system completes the process of sending information.

视频链接：https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==&mid=2650930230&idx=5&sn=057238b4b5ba7a27cc76ce2b4ea89253&chksm=84e43848b393b15e150392aa0315c8dc9771cff17a4624e665eb5e5345bcbf780b7fd2844134&token=2010422951&lang=zh_CN#rd

Imagine if an intelligent agent had the ability to accurately operate computers and mobile phones at the same time according to human instructions. Many complicated software operations could be completed by the intelligent agent, thereby improving overall work efficiency.To achieve this goal, we need to build a more comprehensive and realistic cross-platform testing environment for the agent, especially to support simultaneous operation of multiple devices and provide sufficient evaluation feedback mechanism.The CRAB framework in this paper attempts to solve the following practical problems:

Cross-environment task evaluation:Existing benchmarks usually focus on a single environment (such as web, Android, or desktop operating systems)[1][2][3][4], ignoring the complexity of cross-device collaboration scenarios in the real world.The CRAB framework supports encapsulating the interaction of a device or application into an environment. By supporting multi-environment tasks, it provides the intelligent agent with a richer operating space that is closer to actual application scenarios.
Fine-grained evaluation methods:Traditional evaluation methods either focus only on the completion of the final goal (goal-oriented) or strictly compare the operation trajectory (trajectory-oriented)[1][2][3]Both of these methods have limitations and cannot fully reflect the performance of the intelligent agent.CRAB proposes a graph-based evaluation method that can provide fine-grained evaluation indicators and adapt to a variety of effective task completion paths.
Task construction complexity: As the complexity of the task increases, it becomes increasingly difficult to manually construct the task and the evaluator.CRAB proposes a method based on subtask combination to simplify the construction process of cross-environment tasks.
Evaluation of Agent System Structure:This paper also explores the impact of different agent system structures (single agent, multi-agent based on functional division of labor, multi-agent based on environmental division of labor) on task completion results., providing an empirical basis for designing more efficient intelligent agent systems.

The above table shows the comparison between the CRAB framework proposed in this paper and other existing intelligent agent benchmark frameworks. Compared with other benchmarks,CRAB can support cross-platform operating environments such as computers and mobile phones at the same time, and can simulate more realistic usage scenarios。

Many netizens gave high praise to CRAB.

Some people say that AGI has been achieved because a large language model (CRAB) has learned how to exit Vim.

"Can you exit Vim?" This question is often a joke in the programming or technical community, because Vim may be difficult for novices to exit, especially when they are not familiar with Vim's operating mode. (Contribute an emoticon package here)

Some people say it is hard to believe that an intelligent agent can complete the series of tasks of "checking the calendar, opening Vim, entering insert mode, entering the event list, exiting insert mode, and saving with :wq".

Some netizens also concluded that the next generation of robotic process automation (RPA) will be more like "please help me complete the following tasks", without the need to record every step and then crash when running within a few days.

Some people also mentioned that the Graph Evaluator in CRAB is a very smart way to handle the state of the agent in the environment.

Some even praised CRAB as the future of AI PCs, believing that it is the perfect combination of LLM with PCs and mobile devices. "It is a RabbitOS-like AI that enables existing PCs and mobile devices to have AI capabilities. CRAB's benchmark allows for testing the effectiveness and practicality of multimodal language model agents in the real world."

Each node in GDT can represent a subtask (m,i,r), where m is the environment in which the subtask is executed, i is the natural language instruction, and r is the reward function.It is used to evaluate the state of environment m and output a Boolean value to determine whether the subtask is completed. The edges in GDT represent the sequential relationship between subtasks.。

CRAB Framework

Cross-environment agent interaction

CRAB first introduced the concept of cross-environment tasks, combining multiple environments (such as smartphones and desktop computers) into one environment set, enabling intelligent agents to coordinate operations across multiple devices to complete complex tasks.

The operation process of the multi-agent system based on environmental division of labor in the CRAB framework is shown in the figure above.The workflow proceeds through a loop, where the main agent observes the environment and specifies plans for the sub-agents, and then all sub-agents perform operations in their respective environments.A Graph Evaluator then monitors the status of each subtask in the environment and continuously updates the completion status of the tasks throughout the workflow.This evaluation method can be close to the real scene to test the reasoning ability of the intelligent agent., which requires the agent to be able to handle complex message passing and have a deep understanding of real-world situations.

Graph Evaluator

CRAB's built-in graph evaluator combines the advantages of both goal-oriented and trajectory-oriented evaluation, which first decomposes the complex task into multiple subtasks to form a directed acyclic graph structure.Then a node activation mechanism is defined, that is, the nodes (subtasks) in the graph need to be gradually activated according to the completion of the previous tasks., ensuring the sequential execution of tasks. Each node is associated with a verification function that checks the key intermediate states in the environment.Compared with previous evaluation benchmarks, the CRAB graph evaluator innovatively introduces a series of new evaluation metrics.：

Completion Ratio (CR): The ratio of the number of completed subtask nodes to the total number of nodes, CR = C / N.
Execution Efficiency (EE): The ratio of the completion rate to the number of actions executed. EE = CR / A, where A is the specified number of actions.
Cost Efficiency (CE): The ratio of completion rate to the number of model tokens used. CE = CR / T, where T is the number of model tokens used.

These metrics provide a more fine-grained and multi-dimensional evaluation focus for agent benchmarks.

CRAB Benchmark-v0

Benchmark build details

Based on the proposed CRAB framework,This paper constructs a specific benchmark set CRAB Benchmark-v0 for the community to further conduct researchCRAB Benchmark-v0 supports both Android phones and Ubuntu Linux desktops. Different action sets are defined for Ubuntu and Android to simulate common interactions in real life.The observation space consists of two system interfaces of the environment, and the environment status is obtained in the form of screenshots.To facilitate the agent's operation in the GUI, the authors used GroundingDINO [7] to locate interactive icons, used EasyOCR to detect and annotate interactive text, and assigned an ID to each detected item for easy reference in the operation space.

Let's take a specific task as an example, for example, complete the following task on an Ubuntu system: create a new directory "/home/crab/assets_copy" and copy all files with the specified "txt" extension from "/home/crab/assets" to the directory "/home/crab/assets_copy".

This task requires multiple steps to complete. The following figure shows how GPT-4 TurboExperimental details when used as an inference model with a single-agent structureThe agent first uses the search_application command to find the terminal and open it.

Then use the Linux command "mkdir -p /home/crab/assets_copy" to create a new target directory.

After creating the target directory, the agent directly executed the copy command in the terminal:

"cp /home/crab/assets/*.txt/home/crab/assets_copy" to complete the task. The whole process was smooth and there were no mistakes.

Experimental results

The authors then conducted a baseline experiment on CRAB Benchmark-v0.The core of the agent is the multimodal language model at the backend, which is used to provide natural language and image understanding, basic device knowledge, task planning and logical reasoning capabilities,Need to support multi-modal mixed input and be able to handle multiple rounds of conversations at the same timeTherefore, the authors selected GPT-4o (gpt-4o-2024-05-13), GPT-4 Turbo (gpt-4-turbo-2024-04-09), Gemini 1.5 Pro (May 2024 version) and Claude 3 Opus (claude-3-opus-20240229) as baseline models.

The experimental results are shown in the table above, where the GPT-4o and GPT-4 Turbo models achieved the highest average test point completion rate (CR) among the test models.In terms of execution efficiency (EE) and cost efficiency (CE), the GPT-4 series is also better than the Gemini and Claude series models.。

, duration 02:37

Summarize

This paper introduces a new cross-environment multi-agent evaluation benchmark CRAB.The CRAB framework provides a more comprehensive, flexible and practical benchmarking platform for the evaluation of autonomous agents by introducing cross-environment tasks, graph evaluators and task construction methods based on subtask combinations.Compared with previous agent benchmarks, CRAB reduces the manual workload in task steps and greatly improves the efficiency of benchmark construction. Based on CRAB, this paper proposes Crab Benchmark-v0, which supports agents to perform a variety of complex cross-environment tasks on Ubuntu and Android systems.It can not only promote the development of autonomous intelligent agent evaluation system, but also provide new inspiration for the design of more efficient intelligent agent systems in the future.。

refer to:

[1] Shuyan Zhou et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. Oct.24, 2023. URL: http://arxiv.org/abs/2307.13854. preprint.

[2] Chi Zhang et al. AppAgent: Multimodal Agents as SmartphoneUsers. Dec. 21, 2023. URL: http://arxiv.org/abs/2312.13771. preprint.

[3] Shunyu Yao et al. “Webshop: Towards scalable real-world web interaction with grounded language agents”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 20744–20757.

[4] Tianbao Xie et al. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. Apr. 11, 2024. URL: http://arxiv.org/abs/2404.07972. preprint.

[5] Lin, Fangru, et al. "Graph-enhanced Large Language Modelsin Asynchronous Plan Reasoning." arXiv preprint arXiv:2402.02805 (2024).

[6] Tushar Khot et al. “Decomposed Prompting: A Modular Approach for Solving Complex Tasks”. In: The Eleventh International Conference on Learning Representations. 2023. URL: https://openreview.net/forum?id=_nGgzQjzaRy.

[7] Shilong Liu et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv.org. Mar. 9, 2023.

news

Simultaneously control mobile phones and computers, 100 tasks, cross-system intelligent agent evaluation benchmark is now available

Introduction

My contact information