news

From spatial intelligence to embodied intelligence, the most efficient path to cross-dimensional implementation of Sim2Real AI

2024-07-22

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Published by Synced

Synced Editorial Department

In the more than one year since the embodied intelligence craze began, the way the physical world and information are produced and interacted has undergone revolutionary changes.

At the same time, a new battle is quietly starting: major manufacturers are racking their brains to seize the most valuable AI "fuel" - data. At present, data shortage is still a high wall in front of general embodied intelligence. Looking back over the past three years, in the research on embodied intelligence by well-known companies such as Google, Nvidia, and OpenAI, there has been no glimpse of the emergence of Scaling Law, which is related to the lack of various types of data.

How to solve this fundamental pain point? From a technical perspective, Sim2Real AI is a long-standing path. However, due to the "conceptual bias" in eliminating the Sim2Real gap, academia and industry regard it more as an auxiliary means of data supplementation.

But is it really so?

Jia Kui, a tenured professor at the Chinese University of Hong Kong (Shenzhen) and founder of Interdimensional Intelligence, gave the answer through his long-term practice from academia to industry: "Sim2Real AI is the most efficient path to embodied intelligence."

From two-dimensional vision to three-dimensional vision, from spatial intelligence to embodied intelligence, from scientific research to products and then to commercial implementation, Jia Kui has been exploring this field for more than 20 years. Recently, at WAIC, he had a conversation about how embodied intelligence can break through the data dilemma.

If an AI were to try to understand this conversation, it might help you summarize these key points:

What is the essence of the hottest spatial intelligence and embodied intelligence at the moment?

What is the specific meaning of realizing spatial and embodied intelligence using the Scaling Law paradigm?

Which is the most efficient path to achieve general embodied intelligence?

How does embodied intelligence move from technology to product and then to commercial implementation?

In the future, what imaginations will come true that can break through the industry’s production paradigm?

Of course, there are still some parts that AI cannot understand for the time being - this scientific researcher and entrepreneur has demonstrated his firm confidence and historical mission.

The following is the interview transcript:

Building a “world model”

Triggering the robot's "spirituality"

Q: Professor Fei-Fei Li, known as the "godmother of AI," chose the direction of "spatial intelligence" for her first entrepreneurial venture, which has attracted widespread attention in this field. Can you talk about your understanding of spatial intelligence and embodied intelligence?

Jacqui:Spatial intelligence and embodied intelligence are topics that have gained widespread attention in recent years, but the academic research behind them has been going on for a long time. Spatial Intelligence is a multidimensional concept, usually referring to an individual's cognitive and reasoning abilities in three-dimensional physical space and four-dimensional space-time, including perception, reasoning, and decision-making. Embodied Intelligence refers to the intelligence of an intelligent system that has a physical form and interacts with the environment through this form. Embodied intelligence not only focuses on perception, but also includes the actions and reactions of the intelligent agent to the environment. Just as humans use their eyes to perceive the world, embodied intelligence requires robots to be able to perceive, interact, and make decisions through multimodal sensors to form comprehensive spatial cognition and operational capabilities.

Q: What are the similarities and differences between spatial intelligence and embodied intelligence?

Jacqui:As mentioned earlier, spatial intelligence gives AI the ability to perceive and understand the real world, while embodied intelligence requires not only the perception and cognitive reasoning of objects, environments, and other intelligent entities involved in spatial intelligence, but also the high-level motion planning and low-level motion control required for robot operation, as well as various robot "skills" similar to human operation capabilities defined by the interaction between the robot body and the operation object. The mastery of each skill means that the robot can handle various objects related to the skill, not just a specific, concrete object.

These skills include a collection of “sub-skills” and “atomic skills”, forming a robot skill library, or “skill space”. The essence of embodied intelligence is to learn and generalize this skill space, thereby achieving general artificial intelligence (AGI) with embodied attributes like humans.

In specific applications, spatial intelligence has a wider range and can be attached to or detached from robots. It is essentially a question of understanding space, such as its important application AR/VR. Embodied intelligence is mainly reflected in robots, especially general (humanoid) robots.

In general, spatial intelligence focuses more on cognitive and reasoning abilities in four-dimensional space-time, while embodied intelligence further includes the ability to interact directly with the environment through physical form.

Q: Why did you choose to start a business in the field of space and embodied intelligence?

Jacqui:We have paid attention to this field for a long time, and we have a deep historical accumulation and technical accumulation. The team established the "Geometry Perception and Intelligence Laboratory" in the early days, when this field had not yet been involved by the well-known "big companies".We are one of the earliest scholars and teams in China to apply artificial intelligence technology to non-European data such as three-dimensional data.

Our team has conducted a lot of cross-disciplinary innovative research in the fields of geometric deep learning, 3D modeling, spatial perception, robotic applications, etc., and achieved a series of representative results, including Grasp Proposal Networks (NeurIPS 2020), Analytic Marching (ICML 2020/TPAMI 2021), Sparse Steerable Convolution (NeurIPS 2021), 3D AffordanceNet (CVPR 2021), Fantasia3D (ICCV 2023), SAM-6D (CVPR 2024), etc.



DexVerse™ 2.0 introduces the new 4D Mesh technology, which is designed for dynamic physics simulation and data rendering generation, and can uniformly handle multiple objects such as rigid bodies, soft bodies, fluids, etc. As the core expression form of the engine, 4D Mesh will run through the entire process from physical simulation, data annotation generation to large model training.

视频链接:https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==&mid=2650927069&idx=1&sn=32b8072ec663f02350d310f082511ebb&chksm=84e42ba3b393a2b5a5ca60fb8582ae4320820f4eb88e827a2f5830eedcc274e6a904482c6f59&token=263296417&lang=zh_CN#rd

Q: What is your understanding of the core concept of space and embodied intelligence? In this hot track, what are the advantages of cross-dimensionality?

Jacqui:We believe that,The core of spatial and embodied intelligence lies in establishing a "world model" to enable robots to have "spirituality" similar to human perception.Specifically, it is necessary to establish a "world model" that can accurately model, understand and reason about spatial geometry and physical processes, so that various robot sensors including vision, force, touch, etc. can have human perception capabilities.

Under the current AI architecture and model paradigm, our team hopes toThrough generative physical simulation, we can capture the four-dimensional space-time mirror of the world in which humans live, thereby obtaining endless physical property data - this is the key to realizing spatial and embodied intelligence.

Therefore, since its inception, Kuawei has built its own underlying DexVerse™ spatial and embodied intelligence engine, which can realize the full-chain automation of "physical simulation-data synthesis-model training" for specific business scenarios, and based on this, it has formed a large model kit of spatial and embodied intelligence and pure visual intelligent sensors, giving general robots an intelligent brain and eyes.

Currently, Kuawei has achieved a mission success rate of over 99.9% in multiple commercial scenarios using 100% synthetic data and millimeter/sub-millimeter operating accuracy requirements.

Universal Space and Embodied Intelligence

How far is the end?

Q: You just talked about implementing spatial and embodied intelligence with the Scaling Law paradigm. Can you elaborate on what it means? Is it more difficult to implement universal spatial and embodied intelligence than to implement universality of large language models? Why is it difficult?

Jacqui:It is indeed more difficult to achieve universal space and embodied intelligence than to achieve universality of large language models. Large language models represented by OpenAI's GPT series, by utilizing massive natural language texts and combining "self-supervised pre-training + supervised learning + reinforcement learning intent alignment", have achieved zero-shot natural language understanding tasks, that is, universal capabilities, showing the dawn of the so-called AGI.

Human natural language can be seen as a highly abstracted semantic code of the universe and natural environment we live in. Therefore, it is relatively easy for large language models to learn and generalize directly at the abstract level.

In comparison,Spatial intelligence needs to learn from the raw signals obtained from sensors, which means crossing the "semantic gap" from raw digital signals to human semantic symbols.To learn general intelligence through a Scaling Law paradigm similar to GPT, a large amount of training data is required;The training data for spatial intelligence not only requires a large amount, but also requires precise calibration of the raw signals obtained by the sensors to ensure that they have measurements on an absolute physical scale. This is much more difficult than obtaining massive image and text data from the Internet.

Embodied intelligence goes a step further. In addition to learning general intelligence from high-dimensional sensory signals such as vision, force, and touch,Its more essential goal is to learn the robot's "skill space" defined by the robot body and the object it operates on. The universality of embodied intelligence is reflected in the generalization of the skill space, which increases the difficulty of learning different paradigms.

Q: Can you talk about what specific multimodal large model capabilities are needed for spatial intelligence and embodied intelligence?

Jacqui:Spatial intelligence involves tasks such as perception, interaction, reasoning, and decision-making in the three-dimensional physical world. Embodied intelligence further requires the formation of a robot's autonomous operation skill library based on intelligent analysis of spatial perception signals such as vision, force, and touch.

Therefore, multimodal large model capabilities are needed, including natural language, force touch, vision, robot body state and other modalities.These multimodalities can be "integrated" in a common semantic, spatiotemporal and skill space, thereby achieving human-like spatial and embodied intelligence.

Q: In your opinion, how far are universal space and embodied intelligence from the end game?

Jacqui:At present, the Scaling Law AI paradigm, which is characterized by massive data, large models and huge computing power, is based on the premise that general robot hardware is mature, that is, core components such as humanoid robots, dexterous hands, and humanoid sensors can be stably mass-produced in a cost-effective manner. It can at least support spatial and embodied intelligence in multiple closed loops of business scenarios with boundaries and reasonable ROI, and form independent commercial value.

Specifically, robots can complete a variety of tasks in a generalizable manner in multiple scenarios such as industry, logistics, commerce, and home. Of course, this requires the acquisition of massive multimodal data with physical properties, as well as the automatic calculation of rich annotations that support multiple learning strategies such as supervised training, imitation learning, and reinforcement learning.

The most efficient path to universal embodied intelligence

Q: I noticed that you mentioned in your WAIC speech that "Sim2Real AI is the most efficient path to achieve embodied intelligence." Can you elaborate on this?

Jacqui:To achieve embodied intelligence, the nature and purpose of the data must be considered. The goal of embodied intelligence is to enable robots to achieve general operational capabilities in the ever-changing physical world based on sensor signals such as vision, force, and touch, just as we humans do in our daily lives.

Under the Scaling Law AI paradigm, that is, machine learning models do not have true general intelligence or generalization, but only have the ability to "interpolate" in learning statistical distributions and their statistical distributions, training embodied intelligent robots requires a large amount of data.

These data should cover all kinds of operation conditions involved in each robot skill, such as all operation conditions from morning to night, spring, summer, autumn and winter, indoors and outdoors. If we rely on robot data collection systems or wearable devices, such as the familiar "teleoperation", then to collect enough data, we first need to establish a business model so that users can help collect data while enjoying the service and commercial value, but there is no such way at present.

In comparison,Sim2Real AI can cover all of the above variations more efficiently through physical simulation and synthetic data.This approach allows the simulation of various operating objects, environmental changes, robot configurations, and sensor changes in a virtual environment, and can share the underlying physical simulation and data generation capabilities for different business scenarios. Any operating object, including rigid bodies, hinges, soft bodies, fluids, etc., can support data generation through accurate physical simulation.

So, in summary,Although remote operation using a robot data collection system or wearable devices can quickly demonstrate some human-like operating actions, this approach is "far from the right path" compared to the embodied intelligence capabilities required to achieve a general robot. Sim2Real AI is the most efficient path to achieve the goal.

Question: Under this technical path, how can we eliminate the GAP between synthetic data and real data?

Jacqui:From the perspective of academia, Sim2Real AI is a long-standing technology path and one of the mainstream paths to achieve spatial and embodied intelligence. Our team also started from academia and successfully blazed a unique trail in the process of product and business implementation: we can achieve a mission success rate of more than 99.9% in multiple scenarios with 100% synthetic data and millimeter/sub-millimeter accuracy requirements, which may be unique in the world.

Any success is not accidental, but based on a deep understanding of the problem and a systematic solution. Starting from the first principles and thinking about the inner meaning of things, the cross-dimensional team found an effective solution by simplifying complex problems and disassembling them layer by layer.

Simply put, to implement embodied intelligence in the way of Sim2Real AI, we need to:

1) Robot body simulation, multimodal sensor simulation, simulation of operation objects of different forms, and dynamic process simulation;

2) Generate data and annotation rendering corresponding to the simulation;

3) Establish an automated chain for Sim2Real migration, including the design and training of embodied intelligence big models, and at least overcome the following core technical barriers:

Low-level controllable embodied physical simulation

Efficient multimodal large model training and continuous learning

Effectively deal with the difference between synthetic and real data domains

Low-cost acquisition of massive digital assets

Q: Based on the Sim2Real AI technology path you just mentioned, what are the practical results of Cross-Dimension?

Jacqui:Kuawei has built an embodied intelligent engine DexVerse™ from the bottom up, including modules such as physical simulation, data rendering generation, automatic annotation calculation, model design and training. This engine does not require the participation of R&D personnel.It is possible to automatically generate AI model SDKs for embodied intelligence tasks throughout the entire chain. The data generation speed is synchronized with the training iteration speed of the AI ​​model, so there is no need to store data at all. The amount of accumulated training data will no longer be a quantitative standard for the implementation of embodied intelligence.Currently, the implementation of Kuawei’s software and hardware products in multiple scenarios is supported by DexVerse™.



As shown in the figure above, DexVerse™ 2.0 goes a step further:

First, given a business scenario with clear boundaries and robot hardware configuration, DexVerse™ 2.0 can use a large language model to automatically disassemble the robot skills and sub-skills involved.

Secondly, for any skill or sub-skill, DexVerse™ 2.0 can automatically generate digital assets such as objects and scenes required for simulation, and generate robot operation process data strips in the virtual space based on the simulation rendering of these assets.

Next, the embodied intelligent 3D VLA (Vision Language Action) model is trained through data generation in the virtual space.

Finally, the trained model can drive the robot body within the selected business scenario and complete various robot skill operations in a universal manner.



Through the DexVerse™ Embodied Intelligence Engine 2.0, the entire chain is automatically performed to perform task decomposition, scene generation, training configuration generation, and model training. The trained model is then imported into the real machine to guide the robot to complete the deer building block assembly operation.

视频链接:https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==&mid=2650927069&idx=1&sn=32b8072ec663f02350d310f082511ebb&chksm=84e42ba3b393a2b5a5ca60fb8582ae4320820f4eb88e827a2f5830eedcc274e6a904482c6f59&token=263296417&lang=zh_CN#rd

Through this fully automated engine, the flywheel for general-purpose robots to cultivate embodied intelligent skills/sub-skills will turn most efficiently, driving the implementation of general-purpose robots in more scenarios.Kuawei will cooperate with more industry parties, open up the ecosystem, achieve win-win cooperation, and jointly promote the rapid development of China's embodied intelligence and general robotics industries.

Q: Why did DexVerse™ choose to develop its own engine? What is the difference between DexVerse™ and NVIDIA's Omniverse™?

Jacqui:The concept of cross-dimensional embodied intelligence engines is completely different from the concept of engines such as NVIDIA's Omniverse™.

If Omniverse™ is a horizontal expansion, covering different sectors such as robotics, scientific computing, and AI for Science, while serving NVIDIA's AI computing power products, thenThe cross-dimensional DexVerse™ is an end-to-end vertical penetration, and the iterative evolution of the engine is to serve the realization of embodied intelligent skill tasks in vertical scenarios.

At present, Sim2Real AI is still in the stage of innovation-driven product and business implementation. Only by relying on self-developed engines can it support each link in the research and development process, from physical simulation, data rendering generation, automatic annotation calculation, embodied intelligent model design and training. Only by tackling each link point by point and mastering the know-how can the product be truly implemented in business scenarios.

The L1-L5 Path to Embodied Smart Business

Q: What do you think is the path for embodied intelligence to move from technology to product and then to commercial implementation?

Jacqui:The essence of embodied intelligence is to give various robots general operating capabilities in different application scenarios by learning a robot skill library that includes various generalizable skills. Therefore, its commercialization must target bounded business scenarios such as industry, agriculture, commerce, and personal/family, and "start from the end" to form product value and commercial implementation by establishing general robot skills in independent business scenarios.

Technically, embodied intelligence must use the Sim2Real AI approach to open up an automated chain of task understanding, digital asset generation, data simulation generation, and AI model training, to achieve general robot task learning in the most efficient way, and in the process form software and hardware products suitable for different business scenarios, including embodied intelligence SoCs, smart sensors, general robot controllers, etc.

In terms of the path, embodied intelligence needs to first empower relatively mature hardware entities such as robotic arms and composite robots, and with the mature mass production of general entities such as dexterous hands and humanoid robots, further enhance the overall capabilities and generate greater commercial value.



Q: Based on the five stages of highly general embodied intelligence L1-L5 that you proposed, which stage has cross-dimensionality reached now?

Jacqui:Based on its self-developed DexVerse™ embodied intelligence engine, Kuawei has established full-chain capabilities such as scenario task understanding, digital asset generation, data simulation generation, and AI model training to serve application scenarios such as smart manufacturing and smart agriculture, and has formed embodied intelligent products including smart vision sensors, PickWiz software, and composite robots.

At present, Kuawei has successfully implemented the "Simulation to Reality" business model in more than 30 industries including automotive parts, 3C manufacturing, new energy, home appliances, chemicals, logistics, etc., and has cooperated with many leading customers in the industry including GAC, Midea, Haier, Panasonic, Lens Technology, etc.

Referring to the above figure L1-L5, Interdimensional has completed the development of the L1 stage of embodied intelligence and is steadily moving towards the L2 level, which is rare in the world.

Q: What do you think the final ecological chain of embodied intelligence and humanoid robots will be like? Will Kuawei make complete hardware for (humanoid) robots?

Jacqui:The general robot final ecological chain consists of humanoid body manufacturers, parts manufacturers, sensor manufacturers such as vision and touch, embodied intelligent chips and solution providers, etc. In the process of the industry chain moving to the final state, the cross-dimensional DexVerse™ embodied intelligence engine will play a decisive role in the technical path, product form, and scene business landing. Through the Sim2Real AI full-chain capabilities of DexVerse™, starting from the end, it promotes the unified standards of embodied intelligent robots in hardware configuration, sensor selection, data modality paradigm, and multi-modal large models in a commercial closed-loop manner.

Kuawei has developed embodied intelligent products such as composite robots, intelligent visual sensors, and PickWiz software. In the process of implementing more commercial scenarios, Kuawei will first enable relatively mature mobile/wheel-foot chassis + dual robotic arms embodied intelligent entities, and ultimately form a joint force with humanoid robot entity manufacturers to achieve widespread implementation of universal embodied intelligence.