news

The most powerful agent Agent Q is released! Llama 3's success rate soars 3 times, and OpenAI's mysterious "Strawberry" is intercepted

2024-08-14

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


New Intelligence Report

Editor: Qiao Yang is so sleepy

【New Wisdom Introduction】Startup MultiOn recently released Agent Q, which it claims is the "strongest agent currently available," and it can achieve a 95.4% success rate in real booking tasks. Netizens have speculated that OpenAI's mysterious Q* project is behind it.

Before OpenAI's Q*/Strawberry project was released, a startup called MultiOn released an agent called Q.


We are very happy to announce that our work over the past 6 months - Agent Q is now live! This is a self-supervised agent framework that can reason and search, and can perform self-play and reinforcement learning on real tasks on the Internet, self-correcting and autonomously improving!

What has attracted more attention is that MultiOn co-founder/CEO Div Garg always mentions this conspicuous name when he mentions Agent Q on Twitter.


This attracted continuous attention from the public, and some speculated that the big boss behind Agent Q was OpenAI's Q* project.

Not only that, MultiOn also opened an independent Twitter account for Agent Q, which often outputs all kinds of strange remarks that are "difficult to distinguish between human and machine."

The background picture and basic information of the account are full of strawberries, and even directly paste the photos of strawberries in Ultraman’s own garden that he had previously posted.



But what's amazing is that the followers of this mysterious account include many bigwigs and KOLs, including Y-Combinator CEO Garry Tan, Quora CEO Adam D'Angelo, New York Times columnist Kevin Roose, Wharton School AI professor Ethan Mollick, and several OpenAI employees.

Even Ultraman has recently started to actively interact with this mysterious account and commented on its post that made a joke about "AGI reaching level two".


Whether MultiOn's move is pure hype or a pre-promotion for OpenAI's Q* is a matter of opinion.


Either this will be one of the best AI agents released to date, or Div Garg will ruin the company’s reputation by being associated with the worst hype of all time. This will backfire in the AI ​​community.

Putting aside all the controversies, let's first take a look at how much technical content this Agent Q has.

According to CEO Div Garg, Agent Q not only has planning and reasoning capabilities, but also self-healing capabilities. With just one day of training, they improved Llama 3's zero-shot performance by 340%, achieving a 95.4% success rate in real-world booking tasks.


This is a major step forward for autonomous AI agents to make complex and reliable decisions in real-world environments.

In the official demo video, Agent Q can perform tasks including booking restaurants, meetings, and air tickets, all of which involve multi-step planning, reasoning, decision-making, and interaction with various applications.

Although MultiOn's research team has uploaded the paper on its official website, Agent Q is not yet open for trial use and one needs to register on the waiting list to apply for a beta test opportunity.


Paper address: https://multion-research.s3.us-east-2.amazonaws.com/AgentQ.pdf

The official website states that Agent Q will be open to MultiOn developers and users later this year.

Technical Interpretation

In recent years, although LLM has completely revolutionized the field of NLP and achieved remarkable achievements, it still faces major challenges in interactive environments, especially multi-step reasoning tasks such as web navigation.

Current training methods that rely on static language datasets are insufficient to adapt these models to dynamic real-world interactions.

The emergence of Agent Q is a major milestone in the field of AI agents, which combines search, self-reflection and reinforcement learning to enable planning and self-repair.

By introducing a new learning and reasoning framework, Agent Q addresses the limitations of previous LLM training techniques, enabling it to achieve autonomous web navigation.


Agent Q's steps in executing a booking task

Problems with Current Approaches

Current approaches, such as supervised fine-tuning on curated expert demonstrations, typically perform poorly on multi-agent tasks due to accumulated errors and limited exploration data, thus producing suboptimal policies when complex decisions and adaptive learning are required in dynamic environments.

Agent Q Method and Component

Agent Q combines guided Monte Carlo Tree Search (MCTS) and AI self-reflection and iterative fine-tuning methods, while utilizing RLHF algorithms such as direct preference optimization (DPO) to enable LLM agents to learn from successful and failed trajectories and improve generalization capabilities in multi-step reasoning tasks.

The key components of Agent Q include:

1. MCTS-based guided search: autonomously generates data by exploring different behaviors and web pages, and strikes a balance between exploration and exploitation.

MCTS uses higher sampling temperatures and diverse prompts to expand the behavior space, ensuring that diverse and optimal trajectories can be collected.

2. AI Self-Criticism: At each step, AI-based self-criticism can provide valuable feedback to optimize the agent’s decision-making. This step-level feedback is critical for long-term tasks, as sparse signals often make learning difficult.


3. Direct Preference Optimization: The DPO algorithm fine-tunes the model by constructing preference pairs from data generated by MCTS. This off-policy training method allows the model to effectively learn from aggregated datasets, including suboptimal branches explored during the search process, thereby improving the success rate in complex environments.

Evaluation Experiment

In the task built based on the xLAM-v0.1-r model to simulate an online store, the agent needs to search to find a specific product.

Although methods such as RFT, DPO and beam search can also achieve certain improvements, the extent is not as great as AgentQ.

If Agent Q and MCTS methods are used at the same time, the task success rate can be increased from 28.6% to 50.5%, which is equivalent to the average human level of 50%.


In the real reservation task of Open Table, the agent needs to perform multiple steps, including finding the corresponding restaurant page, selecting the appropriate date and time, selecting the appropriate seat based on the user's preferences, submitting the user's contact information, and finally completing the task.

This complexity is significantly higher than that of Webshop. According to statistics after the experiment, the average number of steps to complete a Webshop task is 6.8, while Open Table doubles this number to 13.9.

Since Open Table is not a simulated dataset but a real online environment, it is difficult to perform automated evaluation. Therefore, the paper uses GPT-4-V as an evaluator to give reward values ​​to the agent's operations at each step based on pre-defined indicators and mark whether the task is completed.


Agent Q improved the zero-shot success rate of LLaMa-3 from 18.6% to 81.7%, a 340% improvement, after only one day of autonomous data collection.

After adding online Monte Carlo tree search, the success rate can be further improved to 95.4%.


Although Agent Q has demonstrated strong web navigation, search, reasoning, and planning capabilities in the above evaluation experiments, there is still much room for discussion and improvement in the current methods used:

- Design of reasoning algorithm: The core challenge of Agent Q is its weak reasoning ability, which limits its exploration and search strategies. In addition, when training the agent strategy, the critic model is frozen, and introducing additional fine-tuning to it may have performance gains.

- Due to the previous success of MCTS in math and coding tasks, Agent Q prefers MCTS for search, but in real environments it may cause a considerable number of risky interactions. Changing the search strategy may be a more appropriate choice.

- Online security and interaction: Currently, Agent Q allows a large degree of autonomous exploration and self-assessment, with limited human intervention. However, the agent's operation may still make many mistakes, especially in key tasks such as email, payment, and archiving.

If the safety issue is not resolved, the task scenarios in which Agent Q can actually be deployed will be greatly limited, and additional safety criticism models and human-in-the-loop training settings may be required in the future.

References:

https://x.com/rm_rafailov/status/1823462897751875701

https://x.com/ai_for_success/status/1823447309008490730

https://www.multion.ai/blog/introducing-agent-q-research-breakthrough-for-the-next-generation-of-ai-agents-with-planning-and-self-healing-capabilities