Apple's new big model achievement: scene inspection big model tool call, netizens: Siri also needs to work hard

2024-08-14

Cressey from Aofei Temple
Quantum Bit | Public Account QbitAI

The Apple team has released a new open source achievement - a set of benchmarks on the calling capabilities of large model tools.

This benchmark innovatively adoptsScenario-based evaluation method, which can better reflect the level of the model in the real environment.

It also introduces important scenarios such as dialogue interaction and state dependency that are not addressed in traditional standards.

This test benchmark is called ToolSandbox, and Pang Ruoming, head of Apple's basic model team, also participated in the research.

ToolSandbox makes up for the lack of scenario-based evaluation in existing test standards and narrows the gap between test conditions and actual applications.

In terms of interaction, the author lets GPT-4o act as a user and have conversations with the model under test, thereby simulating real-world scenarios.

For example, tell GPT-4o that you are no longer an assistant, but you want to play the role of user A who is talking to user B, and then make a series of specific requests.

In addition, the author also used ToolSandbox to test some mainstream models. The results are generallyClosed source models score higher than open source models, the strongest of which is GPT-4o.

iOS app developer Nick Dobos said Apple's standards are simple and clear.

At the same time, he pointed out that ChatGPT is already somewhat stretched in the face of three tools. If Siri wants to manage dozens or hundreds of applications on the phone, it also needs to improve its tool calling capabilities.

What this means is that the research on ToolSandbox may be intended to provide guidance for Siri's future development.

Testing the model in the scene

As mentioned above, ToolSandbox adopts a scenario-based and interactive testing method.

Specifically, ToolSandbox includes nearly 2,000 scenarios in seven categories, including single/multi-tool calls, single/multi-round conversations, state dependency, standardization, and insufficient information.

The previous ones are relatively easy to understand. Here is an explanation for the following three types of scenarios:

State dependency: The execution of a tool depends on some global state, which needs to be modified by other tools first;
Normalization: Convert natural language expressions into the standard form required by the tool, which may require the use of other tools;
Insufficient information: Key tools needed to complete the task are intentionally missing to test whether the model can recognize situations where the task cannot be completed.

In these scenarios, ToolSandbox focuses on three metrics of the model:

Overall performance, that is, the average similarity with the preset answers in various scenarios
Robustness: modify and interfere with the tool in various ways to observe how the model performs in this environment
Efficiency, that is, the average number of rounds to complete a task

In terms of tools, the author selected 34 composable Python functions as tools, which is comparable to the complexity of real scenarios.

It includes both native Python tools and some integrated RapidAPI tools, covering functions in many common areas such as search, conversation, navigation, weather, image processing, etc.

In terms of process, the first step is to prepare the test scenario. Researchers will define and store the initial world state, and use the calibrated GPT-4o model to generate the initial user message.

Then it enters the interactive execution phase. The system first initializes the Message Bus as the communication channel between roles, and configures the model that plays the role of the user and the model under test.

After the conversation loop begins, the model simulating the user sends an initial message, and the model under test receives this message and decides on the next action - directly replying to the user, or calling a tool to interact with the environment.

If the model chooses to call a tool, it provides the necessary parameters in JSON format, and the execution environment then interprets and executes the call, possibly updating the world state and handling potential parallel call conditions.

After the execution result is returned to the model under test, the model under test again decides the next action. This process continues until the user simulator considers the task completed (or cannot be completed), at which point it calls the end_conversation tool to end the conversation.

During the entire interaction process, the system records all messages and state changes to form a complete "conversation track", which then enters the evaluation stage.

Evaluation uses predefined “milestones” and “minefields” to measure the performance of the agent model.

milestoneThe key events for completing the task are defined, forming a directed acyclic graph to reflect the time dependency.

The system seeks the best match between events and milestones in the trajectory while maintaining the topological order of the milestones.

MinefieldIt defines prohibited events, which are mainly used to detect whether the model produces hallucinations when there is insufficient information.

For example, the figure below shows an example of a minefield assessment in the "insufficient information" scenario.

In this task, since the current timestamp is not available, the model should not call the timestamp_diff tool, but the model incorrectly guessed the current timestamp and called the tool, resulting in a score of 0 for this round.

Ultimately, the system calculates a composite score that is the product of the average milestone match score and the minefield penalty.

In addition, the system also counts the average number of rounds required to complete the task as a supplementary indicator for evaluating model efficiency.

Complex interactive scenarios remain a challenge

Overall,Closed-source models perform better than open-source models in tool calls。

The highest average score is GPT-4o, with a score of 73.0, the only one to exceed 70, and it achieved the highest score in four of the seven scenarios set by the author.

Moreover, GPT-4o is extremely robust. The author used 8 methods to modify the tool, and GPT-4o had the highest robustness score in all of them.

Closely following is Claude 3-Opus, with an average score of 69.2, which even outperforms GPT-4o in scenarios with insufficient information, followed by some other versions of GPT and Claude.

Google's Gemini is relatively backward, with a 1.5 Pro score of 60.4, which is just passing and worse than GPT-3.5, but it performs well in the item of insufficient information.

The highest average score of the open source models was only 31.4, among which the more famous Mistral-7B scored 29.8, but achieved the best score of 76.8 in the insufficient information item.

Even some open source models such as Gorilla and Command-R cannot handle tool responses at all, or can only barely complete a single round of tool calls.

Further analysis showed thatOpen source models are poor at identifying when to call a tool, prefer to treat the problem as a pure text generation task.

From the task dimension, the large model performs well in single/multi-tool calls and single-round user requests, butThe advantage is weakened in multi-round dialogue and state-dependent tasks。

In the GPT, Claude, Gemini and other families,Larger models have more obvious advantages in multi-tool calls and multi-round dialogue tasks;butIn state-dependent tasks, small and medium models（如GPT-3.5、Claude-3-Sonnet）But than the large model（GPT-4、Claude-3-Opus）Better performance。

In addition, normalization is a major challenge for all models, especially scenarios that require the help of tools for normalization, and the normalization of time-related parameters is also very difficult.

The study on robustness showed that the sensitivity of the model to changes in tool description, parameter information, etc. varied greatly, and no obvious patterns were found.

In terms of efficiency, stronger models are usually more efficient, but there are exceptions. For example, the efficiency of the Claude series models is generally better than GPT.

In short, large models still face many challenges in terms of tool use and dealing with complex interactive scenarios in the real world.

About the Author

ToolSandbox team members come from multiple teams at Apple, including machine learning, data science, and basic large models.

The first author is a Chinese machine learning engineerJiarui Lu, graduated from Tsinghua University with a bachelor's degree, and worked as a research assistant in Professor Zhu Jun's laboratory during his studies.

Lu then obtained a master's degree in machine learning from Carnegie Mellon University and joined Apple in 2020 after graduation.

Including Lu, signed10 of the 12 authors are Chinese, and all of them have backgrounds in famous schools.

This also includes the head of the basic large model teamPang Ruoming（Ruoming Pang）。

In addition, the engineering director who worked at Apple for 8 yearsBernhard AumayerAlso participated in this project.

Paper address:
https://arxiv.org/abs/2408.04682

news