Tencent Chief Scientist Zhang Zhengyou: Simply stuffing a large model into a robot will not produce true embodied intelligence

2024-07-17

Zhang Zhengyou Chief Scientist of Tencent and Director of Tencent Robotics X Lab

In order to explore the human-machine relationship in the AI era and guide the society to think about the economic development opportunities and social response strategies in the era of human-machine symbiosis, Tencent Research Institute, Qianhai International Affairs Research Institute, Qingteng, Hong Kong Science and Technology Park Corporation and other institutions jointly organized “Prospects of Human-Machine Relationship in the AI Era”Forum, this is also“Artificial Intelligence + Social Development Series of High-end Seminars”The second issue of .

At the forum, Zhang Zhengyou, Chief Scientist of Tencent and Director of Tencent Robotics X Lab, introduced the progress of Robotics X Lab in developing intelligent robots based on "hierarchical" control in his keynote speech "Some Challenges and Progress of Embodied Intelligence". "Hierarchical" includes three levels of control: the body, the environment, and the task. The advantage of hierarchical embodied intelligence is that the knowledge at each level can be continuously updated and accumulated, and the capabilities between levels can be decoupled.Tencent Robotics X Lab developed its own five-finger dexterous hand and robotic arm this year. The mobile chassis was also integrated into the robot for the first time. Together with the perception and planning big models, the robot can communicate freely and complete tasks.

As for how intelligent robots will enter people's lives, Zhang Zhengyou said: "In the long run, robots will definitely enter thousands of households, and at present, robots may first bring huge changes in the fields of rehabilitation and elderly care, personalized education, etc."

The following is the full text of Zhang Zhengyou’s sharing:

Good afternoon, distinguished leaders, distinguished guests, distinguished teachers, and distinguished students. Today I would like to share with you some challenges and progress in embodied intelligence.

As for what is embodied intelligence, this term suddenly became popular last year, and everyone felt it was very cool. In fact, embodied intelligence is relative to non-embodied intelligence. ChatGPT, for example, has intelligence without a body. For me, an embodied intelligent body is an intelligent robot. As for whether this intelligence should have a body or not, for us who make robots, we definitely hope to have a body, because having a body can develop intelligence better.

At the beginning of 2018, Ma Huateng, Chairman and CEO of Tencent, decided to establish Tencent Robotics X. At that time, I posted the following message on WeChat Moments (content on April 6, 2018): "A body without a soul is a zombie, and a soul without a body is a ghost. We don't want to be zombies, we don't want ghosts to wander around, we create robots that work in harmony with humans and help each other!" In other words, we want to create intelligent robots to enhance human intelligence, unleash human physical potential, care for human emotions, promote interaction between humans and robots, and usher in an era of coexistence, co-creation and win-win between humans and robots. This is the original intention of our establishment of Tencent Robotics X.

In fact, whether intelligence needs to be embodied is controversial, and this controversy mainly revolves around cognitive science. In this field, people believe that many cognitive characteristics require the overall characteristics of the organism to shape the intelligence of the organism, but some people believe that intelligence does not require a body, because we are mainly faced with tasks such as information processing, problem solving, and decision-making governance, which can be achieved through software and algorithms. The term and concept of embodied intelligence have existed for a long time. For many people, the body is crucial to intelligence, because intelligence originates from the interaction between the organism and its environment, and the interaction between the two is conducive to the growth and development of intelligence.

Looking back at the article Turing wrote in 1950 on how to achieve machine intelligence, we can see that some people think that we can use some very abstract activities, such as playing chess, to achieve (intelligence), and some people think that machines should have some organs, such as speakers, to help us achieve machine intelligence faster. However, Turing himself said that he didn't know which type was best. Open AI also bought hundreds of robotic arms at the beginning, hoping to use robots to achieve AGI directly. After more than a year of hard work, they found that this path was temporarily unworkable, so they gave up and focused their energy on large text-based models, and finally successfully developed ChatGPT.

Robots have a long history. They were initially used to automate robotic arms on production lines, which means they could complete a series of actions in a known environment with precise control. I call it zero intelligence because this process does not require any intelligence. Although this type of robot has very strong operational capabilities, these operational capabilities are pre-programmed for a fixed environment and are zero intelligence.

Entering the era of big models, some people think that big models are very powerful and can be realized immediately by putting them on robots. In fact, this is not the case. What is the current situation? To make an analogy, it is equivalent to putting a 20-year-old brain on a 3-year-old body. Although the robot has a certain mobility, its operation ability is very weak. True embodied intelligence must be able to learn and solve problems autonomously, and be able to automatically adjust and plan when the environment changes and is uncertain. We believe that embodied intelligence can lead to AGI or create a general intelligent robot. A very important process.

Specifically,Embodied intelligence is the ability of an intelligent entity (intelligent robot) with a physical carrier to accumulate knowledge and skills through perception, control and autonomous learning in a series of interactions, forming intelligence and influencing the physical world.This is different from ChatGPT. Embodied intelligence acquires knowledge through human-like perception (vision, hearing, language, touch), and abstracts it into a semantic expression to understand the world and take actions to interact with the world. This involves the integration of multiple disciplines, including mechanical engineering automation, embedded system control optimization, cognitive science, neuroscience, etc. It is a capability that can emerge after all fields develop to a certain level.

Embodied intelligence faces many challenges.

First, complex perception capabilities, including vision and hearing. The current large models, including GPT-4o, only include vision and hearing, but not touch. Touch is very important for embodied intelligence. Robots need complex perception capabilities to perceive and understand the unpredictable and unstructured environment and objects around them.

The second is strong execution capabilities, including movement, grasping, and manipulation to be able to interact with the environment and objects.

The third is learning ability, the ability to learn and adapt from experience and data in order to better understand and respond to changes in the environment.

The fourth is adaptive ability, the ability to autonomously adjust one's own behavior and strategies in order to better cope with different environments and tasks.

The fifth is very important. It is not the superposition of these abilities that will achieve embodied intelligence, but the organic and efficient collaboration and integration of these abilities will truly achieve the embodied intelligence we hope for.

Sixth, in this process, the data we need is very scarce. Open AI originally hoped to achieve AGI directly through robots, but gave up due to lack of data. However, data still needs to be solved, and data scarcity is a big challenge. When collecting data in actual scenarios, it is also necessary to protect the privacy of users.

Seventh, because embodied intelligence needs to live in the human living environment, it must ensure the safety of itself and its surroundings.

The eighth is the issue of social ethics. When robots interact with humans, they must follow moral and legal norms and protect human interests and dignity.

Achieving embodied intelligence requires a lot of work. At present, everyone believes that large models can solve the problem of intelligent robots. I have drawn a picture here, which is equivalent to putting a large model into the robot's head, which seems to solve the problem. However, this only achieves partial intelligence. We expect intelligence and ontology to be organically integrated, so that real intelligence can emerge in the interaction between robots and the environment.

In order to achieve this vision,I think a change in the control paradigm is needed.If you read the textbooks on robotics, the traditional control paradigm is a closed-loop process of perception first, planning after perception, action after planning, and perception after action. This control paradigm cannot achieve intelligence. In 2018, I proposed a "SLAP paradigm", where S stands for perception, L stands for learning, A stands for action, and P stands for planning. Perception and action need to be closely connected to respond to the ever-changing environment in real time. Above them is planning to solve more complex tasks. Learning is permeated into each module, and it can learn from experience and data and can autonomously adjust its behavior and strategy. This SLAP paradigm is very similar to human intelligence.

Nobel Prize winner Daniel Kahneman has a book called "Thinking, Fast and Slow", which believes that the human brain has two systems. The first system, System 1, is more inclined to intuition and solves problems quickly. The second system is a deeper and rational thinking, called System 2. In fact, people spend 95% of their time in System 1, and only need to dispatch System 2 for rare and complex tasks. So why is the human brain so efficient that it only takes a few dozen watts to solve thinking problems, and does not even require the energy consumed by a GPU? This is because humans can solve 95% of problems in System 1, and only difficult tasks will go to System 2.

The SLAP paradigm I proposed, at the bottom level, is that the close connection between perception and action can solve the problem of reactive autonomy, which corresponds to System 1. Conscious autonomy is to achieve the rational thinking and thinking of System 2.

Based on the SLAP paradigm, combined with the knowledge of how the human brain and cerebellum control the limbs, we developed a hierarchical embodied intelligence system, which is divided into three layers: the bottom layer is Proprioception, which is the robot's perception of itself, which corresponds to the motor signal that controls the movement of the motor.

The second layer is Exteroception, which is the perception of the environment. Through the perception of the environment, the intelligence knows which capabilities need to be called upon to complete the task.

The top layer is task-related and is called the Strategic Level planner. Only by making good plans based on the specific task, environment, and capabilities of the robot can the task be solved well.

Let's give you some specific demonstrations. The control of the movement at the lowest level (proprioception level) is also learned from data. Here, a real dog is constantly running on a treadmill, and data is collected simultaneously. Through imitation learning and reinforcement learning, the robot learns to move similarly to a real dog. We use a virtual-real integrated world, digital twins, and virtual-real unity. What you see here is only the dog's appearance of movement, but how the robot moves, how much force is needed, and the signal strength of the joints and motors to be sent are all obtained through reinforcement learning.

Another video, in which there is no special human control, is just a robot dog learning how a real dog moves. After it learns, it runs on its own, and it feels a bit lifelike.

This is the most basic ability (motor ability). The next step is to perceive the environment and complete these tasks in the environment. I just talked about moving on flat ground. The second step is to add environmental information. We let it learn to crawl forward, how to climb stairs naturally, how to jump over hurdles and how to jump over obstacles.

At this time, the robot dog has learned how to jump and cross obstacles in the simulation world. This dog is developed by us, named Max. What makes it different from ordinary dogs is that it has wheels on its knees. It can walk faster on flat ground with wheels, and can use four legs on uneven ground, so it can be said to be a different mode combination.

Once we have the ability to adapt to the environment, we can let it do various things. For example, we ask one of the dogs to catch up with the other dog. If it catches up, it wins. To increase the complexity, if a flag appears, the dog that was originally running away can start chasing after it touches the flag. You can see that this is also learned automatically through reinforcement learning. A dog is chasing another dog. Of course, we limit the speed so that the dogs run slower. Now the running dog is chasing, and the chasing dog changes and turns a corner to deceive the other dog.

The advantage of such a hierarchical embodied intelligence is that the knowledge at each level can be continuously updated and accumulated, and the capabilities between levels can be decoupled, so updating other levels will not affect the knowledge of other existing levels.

For example, when a dog chased another dog, it only learned to train on flat ground during reinforcement learning, and no obstacles were added. Now that obstacles have been added, it does not need to relearn, it has learned automatically because it knows how to deal with obstacles at the bottom. You can watch the video. We did not retrain at all. We added obstacles. If it encounters a stick, it will drill through it, and if it encounters an obstacle, it will jump over it. This is automatic (learning).

These works were completed at the beginning of last year and will soon be published in Nature Machine Intelligence, a top international academic journal, and will serve as the cover story, indicating that everyone believes that such work is still leading.

Let me tell you about our past year.Progress in large model fusion, that is, integrating the language model and the multimodal perception model into our hierarchical embodied intelligent system. For example, if a person assigns a task to the robot to fry an egg, the planning model based on LLM decomposes the frying task into the following steps: first, take the egg out of the refrigerator, beat the egg in the pan, and then fry the egg. From the multimodal perception, we must first know that the egg is in the refrigerator, and we need to call the middle-level skills below. The robot must first go to the refrigerator to take the egg out, open the refrigerator door, and hold the egg back to the stove. The bottom is the bottom-level control, which controls how the robot goes to the refrigerator, how to open the refrigerator door, etc. Once learned, it will be completed automatically. Finally, we return to the top-level Strategic Level Planner. Note that in this closed loop, the robot's actions act on a virtual-real integrated world that is closely integrated with the digital world and the physical world. In the digital simulation space, there are robots and scenes that look very real. In this way, the robot's skills learned in the virtual space can be directly applied to the real space.

Here is a video. We put an intelligent robot in an environment it has never seen before. The first step is for the robot to turn around and explore the world. For example, in the video, the robot's task is to send the garbage to the trash can, so it must first find the trash can, and then put the garbage in it. Similarly, move the trash can to another place, assuming that it does not know this environment, and after discovering the trash can through exploration, it will send the garbage to it.

The following scenario is to send the mouse to the person wearing blue clothes and jeans. There are many other people here, and the robot must find the person wearing blue clothes and jeans. It automatically explores and searches. Many people encountered in the process are not wearing blue clothes or jeans. Until the robot sees the blue clothes and jeans, it sends the mouse to them.

During the exploration process, the robot can remember the surrounding environment and does not need to explore again every time. In the following scene, the medicine is first delivered to a colleague, and then the bag of cold medicine is thrown away by the robot. When exploring and modeling, it already knows where the trash can is, so it goes directly to the trash can. It can also use the relationship of space, such as where the stool is and where the whiteboard is. If you want to deliver something to a person between the whiteboard and the high stool, obstacles in the middle can be automatically avoided.

Last year we also made a cocktail-making robot. At that time, we used a self-developed three-finger hand and the chassis was fixed. You can take a look.

This fancy cocktail making process is also done by first recording a real person making cocktails, learning his trajectory, and then implementing it on the robot. There are also tactile sensors on the fingers. Now, if you want to insert the stick into the hole, it is not enough to rely on vision alone, and the accuracy is not enough, so it must rely on tactile perception to see whether it is inserted. If it is not inserted, it will move to the side and finally insert the stick.

This was last year’s work. This year’s work includes the self-developed five-finger hand and the robotic arm, which we also developed ourselves. We did not have a self-developed robotic arm last year. Now we also have a mobile chassis, plus a large perception model and a large planning model, which enable the robot to communicate freely and complete tasks. Let’s take a look at the video.

The lower right corner shows what the mobile intelligent robot sees. It finds a bottle of whiskey on the table and asks it to pour a glass of whiskey. This is what the robot sees from its field of vision, and it can recognize various things in real time.

That’s all for now. Thank you.

news

Tencent Chief Scientist Zhang Zhengyou: Simply stuffing a large model into a robot will not produce true embodied intelligence

Introduction

my contact information