news

The world's first! Researching nearly 400 papers, Pengcheng Laboratory

2024-07-26

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]

Embodied intelligence is the only way to achieve general artificial intelligence. Its core is to complete complex tasks through the interaction between intelligent agents and the digital space and the physical world. In recent years, multimodal large models and robotics have made great progress, and embodied intelligence has become a new focus of global science and technology and industrial competition. However, there is currently a lack of a comprehensive review of the current status of embodied intelligence development. Therefore,Researchers from the Pengcheng Laboratory's Institute of Multi-Agent and Embodied Intelligence and Sun Yat-sen University's HCP Laboratory, a comprehensive analysis of the latest progress in embodied intelligence,Launched the world's first review of embodied intelligence in the era of multimodal big models.

This review surveyed nearly 400 papers and comprehensively analyzed the research on embodied intelligence from multiple dimensions.Embodied Robots and Embodied Simulation Platforms, and deeply analyzed its research focus and limitations. Then, it thoroughly analyzed the four main research contents: 1)Embodied Perception,2)Embodied Interaction,3)Embodied Agentand 4)Migration from virtual to real, which covers state-of-the-art methods, basic paradigms, and comprehensive datasets. In addition, the review explores the challenges faced by embodied agents in digital space and the physical world, emphasizing the importance of their active interaction in dynamic digital and physical environments. Finally, the review summarizes the challenges and limitations of embodied intelligence and discusses its potential future directions. This review hopes to provide a basic reference for embodied intelligence research and promote related technological innovation. In addition, the review also released an embodied intelligence paper list on Github. Related papers and code repositories will continue to be updated, so please pay attention.



Paper address: https://arxiv.org/pdf/2407.06886

Embodied Intelligence Paper List: https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List

1. The past and present of embodied intelligence

The concept of embodied intelligence was first proposed by Alan Turing in the embodied Turing test established in 1950, which aims to determine whether an intelligent agent can demonstrate intelligence that is not limited to solving abstract problems in a virtual environment (digital space) (the intelligent agent is the basis of embodied intelligence, existing in the digital space and the physical world, and is materialized in the form of various entities, which include not only robots but also other devices.), but can also cope with the complexity and unpredictability of the physical world. Therefore, the development of embodied intelligence is seen as a fundamental way to achieve general artificial intelligence. It is particularly important to explore the complexity of embodied intelligence, evaluate its current development status, and think about its future development trajectory. Today, embodied intelligence covers several key technologies such as computer vision, natural language processing, and robotics. The most representative of these areEmbodied perception, embodied interaction, embodied agents, and the transition from virtual to real. In embodied tasks, embodied agents must fully understand human intentions in language instructions, actively explore the surrounding environment, fully perceive multimodal elements from virtual and physical environments, and perform appropriate operations to complete complex tasks. The rapid progress of multimodal models has demonstrated greater diversity, flexibility, and generalization capabilities in complex environments compared to traditional deep reinforcement learning methods. The visual representations pre-trained by state-of-the-art visual encoders provide accurate estimates of object categories, poses, and geometry, enabling embodied models to fully perceive complex and dynamic environments. Powerful large language models enable robots to better understand human language instructions and provide a feasible method for aligning visual and language representations for embodied robots. World models demonstrate significant simulation capabilities and a good understanding of physical laws, enabling embodied models to fully understand physical and real environments. These advances enable embodied agents to fully perceive complex environments, interact with humans naturally, and perform tasks reliably. The figure below shows a typical architecture of an embodied agent.



Embodied Agent Framework

In this review, we provide a comprehensive overview of current progress in embodied intelligence, including: (1)Embodied Robot—— Hardware solutions for embodied intelligence in the physical world; (2)Embodied Simulation Platform—— Digital spaces for efficiently and safely training embodied agents; (3)Embodied Perception—— Actively perceive 3D space and integrate multiple sensory modalities; (4)Embodied Interaction—— Effectively and reasonably interact with the environment and even change the environment to complete the specified tasks; (5)Embodied Agent—— Use a multimodal large model to understand abstract instructions and split them into a series of subtasks to be completed step by step; (6)Migration from virtual to real—— Generalize the skills learned in the digital space to the physical world. The figure below shows the system framework of embodied intelligence from the digital space to the physical world. This review aims to provide comprehensive background knowledge, research trends and technical insights of embodied intelligence.



Overall structure of this review

2. Embodied Robots

Embodied agents actively interact with the physical environment and cover a wide range of embodied forms, including robots, smart home appliances, smart glasses, and self-driving vehicles. Among them, robots, as one of the most prominent embodied forms, have attracted much attention. Depending on the application scenario, robots are designed in various forms to fully utilize their hardware characteristics to complete specific tasks. As shown in the figure below, embodied robots can generally be divided into: (1) fixed-base robots, such as robotic arms, are often used in laboratory automation synthesis, education, industry and other fields; (2) wheeled robots, known for their efficient mobility, are widely used in logistics, warehousing and security inspection; (3) tracked robots, with strong off-road capabilities and mobility, have shown potential in agriculture, construction and disaster response; (4) quadruped robots, known for their stability and adaptability, are very suitable for complex terrain detection, rescue missions and military applications. (5) humanoid robots, with their dexterous hands as the key, are widely used in service industries, healthcare and collaborative environments. (6) bionic robots, which perform tasks in complex and dynamic environments by simulating the effective movement and functions of natural organisms.



Different forms of embodied robots

3. Embodied Intelligent Simulation Platform

Embodied intelligence simulation platforms are critical to embodied intelligence because they provide a cost-effective means of experimentation, can ensure safety by simulating potentially dangerous scenarios, have scalability for testing in diverse environments, have rapid prototyping capabilities, can facilitate a wider research community, provide a controllable environment for precise research, generate data for training and evaluation, and provide standardized benchmarks for algorithm comparison. In order for the agent to interact with the environment, a realistic simulation environment must be built. This requires considering the physical characteristics of the environment, the properties of the objects, and their interactions. As shown in the figure below, this review will analyze two types of simulation platforms: general platforms based on low-level simulation and simulation platforms based on real scenarios.



Universal simulation platform



Simulation platform based on real scenarios

4. Embodied Perception

The "North Star" of future visual perception is embodiment-centered visual reasoning and social intelligence. As shown in the figure below, unlike simply recognizing objects in an image, an intelligent agent with embodied perception must move in the physical world and interact with the environment, which requires a more thorough understanding of three-dimensional space and dynamic environments. Embodied perception requires visual perception and reasoning capabilities, understanding three-dimensional relationships in the scene, and predicting and performing complex tasks based on visual information. This review introduces active visual perception, 3D visual positioning, visual language navigation, non-visual perception (tactile sensors), etc.



Active visual perception framework

5. Embodied Interaction

Embodied interaction refers to scenarios where an agent interacts with humans and the environment in physical or simulated space. Typical embodied interaction tasks include embodied question answering and embodied grasping. As shown in the figure below, in the embodied question answering task, the agent needs to explore the environment from a first-person perspective to collect the information needed to answer the question. An agent with autonomous exploration and decision-making capabilities must not only consider which actions to take to explore the environment, but also decide when to stop exploring to answer the question, as shown in the figure below.



Embodied Question and Answer Framework

In addition to question-and-answer interactions with humans, embodied interaction also involves performing actions based on human instructions, such as grasping and placing objects, thereby completing the interaction between the agent, humans, and objects. As shown in the figure, embodied grasping requires comprehensive semantic understanding, scene perception, decision-making, and robust control planning. The embodied grasping method combines traditional robot kinematic grasping with large models (such as large language models and visual language-based models) to enable the agent to perform grasping tasks under multi-sensory perception, including visual active perception, language understanding, and reasoning.



A language-guided interactive crawling framework

6. Embodied Agents

An agent is defined as an autonomous entity that can perceive the environment and take actions to achieve a specific goal. Recent advances in multimodal big models have further expanded the application of agents in real-world scenarios. When these agents based on multimodal big models are embodied as physical entities, they are able to effectively transfer their capabilities from the virtual space to the physical world, thus becoming embodied agents. In order for embodied agents to operate in the information-rich and complex real world, they have been developed with powerful multimodal perception, interaction, and planning capabilities. As shown in the figure below, in order to complete a task, an embodied agent typically involves the following processes:

(1) Decompose abstract and complex tasks into specific subtasks, that is, high-level embodied task planning.

(2) These subtasks are gradually implemented by effectively utilizing embodied perception and embodied interaction models, or by leveraging the policy functions of the underlying model, which is called low-level embodied action planning.

It is worth noting that task planning involves thinking before taking action and is therefore usually considered in the digital space. In contrast, action planning must consider effective interactions with the environment and feed this information back to the task planner to adjust the task plan. Therefore, it is critical for embodied agents to align and generalize their capabilities from the digital space to the physical world.



Embodied Agent Framework Based on Multimodal Large Model

7. Virtual to Real Migration

Sim-to-Real adaptation in embodied intelligence refers to the process of transferring capabilities or behaviors learned in a simulated environment (digital space) to the real world (physical world). This process includes verifying and improving the effectiveness of algorithms, models, and control strategies developed in simulation to ensure that they perform stably and reliably in the physical environment. In order to achieve adaptation from simulation to reality, embodied world models, data collection and training methods, and embodied control algorithms are three key elements. The figure below shows five different Sim-to-Real paradigms.



Five virtual-to-real migration scenarios

8. Challenges and future development directions

Although embodied intelligence is developing rapidly, it faces several challenges and presents exciting future directions:

(1)High-quality robotics datasets. Obtaining sufficient real-world robotics data remains a major challenge. Collecting this data is time-consuming and resource-intensive. Relying solely on simulated data will exacerbate the simulation-to-reality gap problem. Creating diverse real-world robotics datasets requires close and extensive collaboration across institutions. In addition, developing more realistic and efficient simulators is critical to improving the quality of simulated data. In order to build a general embodied model that can be applied across scenarios and tasks in the field of robotics, it is necessary to construct a large-scale dataset that uses high-quality simulated environment data to supplement real-world data.

(2)Effective use of human demonstration data. Efficient use of human demonstration data involves leveraging the actions and behaviors exhibited by humans to train and improve robotic systems. This process involves collecting, processing, and learning from large-scale, high-quality datasets in which humans perform the tasks that the robot needs to learn. Therefore, it is important to effectively leverage large amounts of unstructured, multi-label, and multi-modal human demonstration data in combination with action label data to train embodied models, enabling them to learn a variety of tasks in a relatively short period of time. By efficiently leveraging human demonstration data, robotic systems can achieve higher levels of performance and adaptability, making them better able to perform complex tasks in dynamic environments.

(3)Complex environment cognition. Complex environment cognition refers to the ability of embodied agents to perceive, understand, and navigate complex real-world environments in physical or virtual environments. For unstructured open environments, current work typically relies on the task decomposition mechanism of pre-trained LLMs, leveraging broad commonsense knowledge for simple task planning, but lacking specific scene understanding. Enhancing knowledge transfer and generalization capabilities in complex environments is critical. A truly versatile robotic system should be able to understand and execute natural language instructions across a variety of different and unseen scenarios. This requires the development of adaptable and scalable embodied agent architectures.

(4)Long-range task execution. Executing a single instruction often involves the robot performing long-range tasks, such as a command like "clean the kitchen", which involves activities such as rearranging items, sweeping the floor, and wiping the table. Successfully completing these tasks requires the robot to be able to plan and execute a series of low-level actions for a long time. Although current high-level task planners have shown initial success, they often fall short in diverse scenarios due to a lack of adaptation to embodied tasks. Addressing this challenge requires the development of efficient planners with strong perception capabilities and a large amount of common sense knowledge.

(5)Causal discoveryExisting data-driven embodied agents make decisions based on correlations within the data. However, this modeling approach does not allow the model to truly understand the causal relationships between knowledge, behavior, and the environment, resulting in biased strategies. This makes it difficult for them to operate in a real-world environment in an interpretable, robust, and reliable manner. Therefore, embodied agents need to be driven by world knowledge and have autonomous causal reasoning capabilities.

(6)Continuous Learning. In robotics applications, continual learning is critical for deploying robot learning policies in diverse environments, but this area remains underexplored. While some recent studies have explored subtopics of continual learning, such as incremental learning, fast motion adaptation, and human-robot interactive learning, these solutions are typically designed for a single task or platform and have not yet considered the underlying model. Open research questions and possible approaches include: 1) mixing different proportions of the prior data distribution when fine-tuning on the latest data to mitigate catastrophic forgetting, 2) developing effective prototypes for inference learning from prior distributions or curricula for new tasks, 3) improving training stability and sample efficiency of online learning algorithms, and 4) identifying principled methods to seamlessly integrate large-capacity models into control frameworks, perhaps through hierarchical learning or slow-fast control, to achieve real-time inference.

(7)Unified evaluation benchmark. Although there are many benchmarks for evaluating low-level control policies, they often differ significantly in the skills they evaluate. In addition, the objects and scenes included in these benchmarks are often simulator-restricted. In order to comprehensively evaluate embodied models, benchmarks that cover a wide range of skills using realistic simulators are needed. In terms of high-level task planning, many benchmarks evaluate planning capabilities through question-answering tasks. However, a more ideal approach is to comprehensively evaluate the execution capabilities of high-level task planners and low-level control policies, especially in executing long-duration tasks and measuring success rates, rather than relying solely on the evaluation of the planner alone. This comprehensive approach enables a more comprehensive assessment of the capabilities of embodied intelligent systems.

In short, embodied intelligence enables intelligent agents to perceive, recognize, and interact with various objects in the digital space and the physical world, showing its significance in achieving general artificial intelligence. This review comprehensively reviews embodied robots, embodied simulation platforms, embodied perception, embodied interaction, embodied agents, virtual-to-real robot control, and future research directions, which are of great significance in promoting the development of embodied intelligence.

About Pengcheng Laboratory Institute of Multi-Agent and Embodied Intelligence

The Institute of Multi-Agent and Embodied Intelligence, affiliated to Pengcheng Laboratory, brings together dozens of top young scientists in the fields of intelligent science and robotics. Relying on independent and controllable AI infrastructure such as Pengcheng Cloud Brain and China Computing Power Network, it is committed to building general basic platforms such as multi-agent collaboration and simulation training platforms, cloud-based collaborative embodied multimodal large models, etc., to enable major application needs such as industrial Internet, social governance and services.