news

The "Embodied Smart Town" is here! Robots are running around the supermarket buying groceries, from Shanghai AI Lab

2024-07-22

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Mingmin from Aofei Temple
Quantum Bit | Public Account QbitAI

The super realistic robot town is coming!

Here, a robot can shop in a supermarket just like a human:



Buy groceries and cook at home:



Getting coffee in the office (with human colleagues nearby):



Not only humanoid robots, but also robot dogs and arm robots can move freely in this "city".



This is the first simulated interactive 3D world proposed by Shanghai AI Lab:GRUtopia(Chinese name: Taoyuan).

Here, by up to100kInteractive, carefully annotated scenes can be freely combined into realistic urban environments.

Including indoor and outdoor, restaurants, supermarkets, offices, homes, etc.89Different scene categories.



NPCs driven by large models, you can communicate and interact with robots in this world.



In this way, various robots can complete various behavioral simulations in the virtual town, which is the recently popular Sim2Real route, and can greatly reduce the difficulty and cost of collecting embodied intelligent real-world data.

The project is planned to be open source, and a demo installation guide is currently available on GitHub.

After successful installation, you can control a humanoid robot to move around the room in the demo, and it can also adjust different viewing angles.



A robot's virtual paradise

There are three core tasks:

  • GRScenes
  • GRResidents
  • GRBench

Among them, GRScenes is a dataset containing large-scale scene data.

It greatly expands the range of environments in which robots can move and operate. Previous work has focused more on home scenarios.

The research stated that their goal is to expand the capabilities of general robots to various service scenarios, such as supermarkets, hospitals, etc. At the same time, it covers indoor and outdoor environments, including amusement parks, museums, exhibition halls, etc.

For each scene, they performed detailed and high-quality modeling. The 100 scenes contain 2,956 interactive objects and 22,001 non-interactive objects in 96 categories.



GRResidents is an NPC system.

It is driven by a large model and has a deep understanding of the scene information in the simulated environment, so NPCs can infer the spatial relationships between objects and participate in dynamic dialogues and task allocation.

With the help of this system, GRUtopia can generate massive scenario tasks for robots to complete.



Through cross-validation with humans, the NPC system has good accuracy in describing and locating objects.

In the description experiment, the NPC system randomly selects an object to describe, and it is considered a success if humans can find the corresponding object.

In the positioning experiment, on the contrary, if the NPC system can find the corresponding object based on the description given by humans, it is considered a success.



The success rates of calling different large models vary. Overall, GPT-4o performs the best.



GRBench is a benchmark for evaluating embodied intelligence performance.

It contains three benchmarks, covering Object Loco-Navigation, Social Loco-Navigation, and Loco-Manipulation, and the difficulty of these three evaluations gradually increases.



In order to analyze the performance of NPC and control API, the study proposed baselines based on LLM and VLM to verify the rationality of the benchmark design.



Experimental results show that using large models as backend agents performs better than random strategies in all benchmarks.

andQwen-VL outperforms GPT-4o in conversations



Finally, from the overall comparison, GRUtopia and other platforms are more powerful in all dimensions.



The research was led by OpenRobot Lab, an artificial intelligence laboratory in Shanghai.

The laboratory focuses on the research of embodied general artificial intelligence and is committed to building a general robot algorithm system that integrates software, hardware, virtual and real.

In May this year, the team also released the embodied multimodal large model Grounded 3D-LLM, which can automatically generate scene descriptions and embodied dialogue data from objects to local areas, effectively alleviating the current limitations of three-dimensional scene understanding.



Paper address:
https://arxiv.org/abs/2407.10943

GitHub address:
https://github.com/openrobotlab/grutopia?tab=readme-ov-file