news

Nanyang Technological University creates task datasets and test benchmarks to improve the task completion capabilities of web agents

2024-07-18

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Recently, by using large models such as GPT-4v and Gemini-pro, Nanyang Technological University intern Zhang Ziniu and his team found that the current capabilities of web agents are still very lacking, especially when completing tasks that are a mixture of multiple sub-tasks.

In order to improve the ability of intelligent agents to operate on web pages, the research team created a task dataset and conducted benchmark tests.

With the help of this dataset, the intelligent agent needs to process multimodal web page information and complete tasks by operating on different web pages, so as to be closer to people's operations on web pages in real situations.

At the same time, the team found that the intelligent agent had major memory defects, which seriously affected the accuracy of multi-hop problems. In response to this, they proposed a memory module to improve the above problem.

Overall, this achievement will improve the agent's task completion capabilities and provide a test benchmark for subsequent work.

According to reports, this achievement is one of a series of works. Initially, Zhang Ziniu, Tian Shulin, Chen Liangyu and others reproduced the single-hop single-modal test benchmark Webarena created by the Carnegie Mellon University team.

Later, by carefully analyzing Webarena's task capabilities and the agent's task completion, they found that there was still a lot of content worth exploring.

For example, why are the tasks not close enough to reality? Why are the intelligent agents lacking in capabilities?

By reading other papers related to web agents, the team considered expanding the task from unimodal to multimodal.

Previously, when web agents process information on web pages, they usually do not just look at the text. To this end, they tried to extract image information from some online websites containing images, such as the official websites of some art galleries.

However, due to their own protection measures, many web pages cannot extract image information from their HTML files.

Later, they turned to extracting image information from shopping sites and Wikipedia and created some multimodal tasks for web agents.

The team then expanded the task to multi-hop tasks and decided to use the travel task as an example for their research. They then tested the agent on the dataset.

They also used a variety of methods to process visual information: for example, directly providing the image as a prompt to the intelligent agent, or first providing the image to a large multimodal model for processing, and then merging the processing results to the intelligent agent.

During the process, they found that the previous evaluation method for the whole task was not suitable for multi-hop tasks. Therefore, they proposed a new evaluation method for multi-hop tasks.

When analyzing the experimental results of the intelligent agent, it was found that the memory ability of the intelligent agent was very poor, so a memory enhancement module was proposed to improve the ability of the intelligent agent, and an ablation experiment was conducted on this.

Recently, a related paper was published on arXiv with the title "MMInA: Benchmarking Multihop Multimodal Internet Agents".


Figure | Related papers (Source: arXiv)

At the same time, the team is also paying attention to the latest progress of web page agents. In the future, the research team may plan to provide screenshots of the entire web page as input to the agent.