A 90s Peking University PhD supervisor builds a humanoid robot, not like Tesla

A 90s Peking University PhD supervisor builds a humanoid robot, not a Tesla

2024-08-17

The United States is not a template for China.

Text丨Wang Yutong
Edited by Cheng Manqi

In May this year, a group of new workers, 1.72 meters tall, came to work at the Texas factory. They were responsible for stacking cylindrical 4680 battery cells from the transfer table into the red box in front of them. They were not very skilled, and even slow and clumsy. But these workers were Optimus, the humanoid robot released by Tesla in 2022, and everything was different.

"Perfect usage scenario", "rapid progress", "unemployment warning", people's comments under the robot working video released by Tesla include amazement and concern.

Wang He doesn't think so. He believes that Optimus is "still a research project" at this stage.

Wang He, born in 1992, is currently an assistant professor and doctoral supervisor at the Center for Advanced Computing Research at Peking University. He graduated from the Department of Electronics at Tsinghua University with a bachelor's degree and from Stanford University with a doctorate. He has published dozens of papers at top conferences in computer vision, robotics, and artificial intelligence, such as CVPR and ICCV.

Since May last year, Wang He has attracted more attention as the helmsman of the intelligent robot company "Galaxy General". In June this year, Galaxy General broke the 2024 angel round financing record with 700 million yuan in financing.

Most companies make complete humanoid robots. Wang He believes that two legs are not the optimal solution at this stage, and it only increases costs. "It's not that the humanoid robot's operating ability is strong enough, but it lacks legs. It's that there are still a lot of jobs that traditional robotic arms can't handle." Wang He believes that two hands are more valuable in scene landing, and a large number of scenes do not actually require bipedal movement, such as inspection and patrol, which robot dogs can do just like cars.

Galaxy Universal's Galbot is picking up trash. It does not have two legs, but a foldable single leg + wheeled chassis.

Obtaining enough data is a difficult point in the development of embodied intelligence. Tesla and Google both choose to collect data through "teleoperation", that is, let real people wear some collection equipment to complete the movements that the robot needs to learn. Wang He thinks this is not profitable: "Google spent more than ten months and tens of millions of dollars to collect hundreds of thousands of data." Galaxy General chose to go all in on "Sim2Real (migration from simulation to real machine)", that is, mainly relying on synthetic simulation data.

Humanoid robot companies in the United States have a lot of money and are bold. Wang He observed that this makes them not strictly look for PMF (Product Market Fit). "In the United States, since there is money, they do it all at once." However, industry problems such as lack of real data and unstable hardware can only be solved by focusing on the scene, so he believes that commercialization should be considered from the first day.

"We should not take what Tesla has done as a standard." Wang He said that Chinese startups "will only end up dead if they continue to tell other people's stories without sufficient capital from the United States."

Wang He does not agree with Tesla, and many people in the industry do not agree with Galaxy General. Taking Sim2Real, which Wang He is interested in, as an example, many practitioners believe that there are natural differences between simulated synthetic data and the real world, which will affect the training effect. After the launch of Galaxy General's first robot Gabot, some competitors said they felt "relieved": "The gap between the demo and the actual application is very large", "I wrote a lot of grasping papers, and the last hand is a suction cup."

The bigger question is that now is not the right time to start a business in humanoid robots. Some investors believe that these companies will become martyrs because many technologies such as hardware, materials, and energy are not yet mature. Kai-Fu Lee talked about embodied intelligence and said, "We definitely cannot invest in something that will happen 10 years later." ZhenFund partner Dai Yusen said that embodiment is still in the BlackBerry era and cannot invest in the iPhone.

Humanoid robots and embodied intelligence are still in the very early stages, and this is an industry with a long chain and a complex technology stack, including AI, materials, energy, mechanical control; development, manufacturing, supply chain management, and customer development. The companies that ultimately survive cannot have any shortcomings.

It is too early to judge the winners, but this interview records what a young scientist saw a year after he set out. He now believes that although large companies have more resources, they are not necessarily right, and this is his opportunity.

Embodied Intelligence and Human SocietyThe greatest common divisor of

LatePost: You started researching embodied intelligence in 2016, combining visual models, natural language models, and robotic operation models. What have you seen from these years of research and development?

Wang He: I was working on embodied intelligence when I was a doctoral student. At that time, it was not called "embodied intelligence". Initially, I combined these three discrete small models together to achieve category-level object pose estimation (pose: the position and posture of an object in three-dimensional space; pose estimation: finding the pose of an object), which is actually a universal two-handed operation.

After returning to China and before founding this company, I installed an arm on the back of the Yushu robot dog and tried to make it perform a series of operations. However, I found that the computing, resources and even the entire system could not meet our needs in many aspects.

At that time, I felt that if I didn’t make hardware, I would have to rely entirely on others, and the system development and iteration would also be limited. It would be difficult to only make intelligence when the robot industry did not exist.

LatePost: What changes have taken place since then? Why did you decide to start a business in 2023?

Wang He: Embodied intelligence entrepreneurship has been fermenting in China earlier than in the United States. The main reason is the maturity of hardware and entities.

The US manufacturing industry does not allow for a complete demo of embodied intelligence to be made quickly. The US has an incomplete supply of parts and many things have to be imported, and there is a shortage of hardware engineers. However, China can make hardware at the lowest cost and with the highest reliability. For example, Yushu made a humanoid robot with just a few people in half a year.

But the main body is just a large toy. The next step is how to compete in intelligence. By 2023, PaLM-E and other embodied multimodal large models will appear around the world, and the spark between multimodal perception and embodied operation will be ignited. I decided to start a business at this time.

LatePost: Why did you choose to start a business making humanoid robots? The carrier of embodied intelligence does not necessarily have to be humanoid.

Wang He: There are indeed various forms, such as dogs, airplanes, and cars. But among all forms, the greatest common denominator between embodied intelligence and human society can only be "human form."

Because the entire production and living environment is designed for humans, humanoids can perform the most operations, and their number will be the largest in the future, with the greatest economic output value. From a vision perspective, embodied intelligence and humanoid robots can be equated.

LatePost: Many people believe that the window for embodied intelligence startups has not yet arrived, and that the current batch of companies will become martyrs, as many technologies such as hardware, materials, and energy are not yet mature. For example, Kai-Fu Lee talked about embodied intelligence, saying "We definitely cannot invest in something that will happen 10 years later"; ZhenFund partner Dai Yusen said that embodiment is still in the BlackBerry era, and that it cannot invest in the iPhone.

Wang He: When I met Professor Kai-Fu Lee in 2019, he said it would take another 50 years. Now he has accelerated from 50 years to 10 years.

We also cannot use mobile phones to compare embodied intelligence. From feature phones to smartphones, technology has changed a lot, and now the technical direction of embodied intelligence has been made clear: the integration of the main body and the large model to become a general-purpose robot.

At this time, the earlier you enter the game, the more technology and data you will accumulate, which will widen the gap in the later stage. After the robot enters the scene, the data of the real scene will supplement the intelligence. It is extremely difficult for latecomers to surpass a company that already has tens of thousands of robots, constantly has real data flowing back, and has already stepped on the pits in the scene.

This andAutonomous drivingSimilarly, only when enough cars are sold can there be enough data, and the algorithm can be improved more quickly with data. In the competition between Google and Tesla, Tesla won because it had enough cars.

Embodied intelligence has the potential to grow into a market comparable to that of cars. It has the same characteristics as previous technological changes: it is slow in the early stages, gradually replacing special robots; but once it reaches a scale of tens of thousands of units, it will accelerate the replacement of traditional industries.

"Late Post": The fact is that the embodied intelligence entrepreneurship boom happened in ChatGPT But in fact, big models can only solve a small part of the problem of embodied intelligence, so some people think it is too early now.

Wang He: Embodied intelligence is the product of the integration of software, hardware and algorithms. At this stage, it is combined with the big model in two aspects: universal perception and language communication, which is to solve the interaction problem. For example, if someone comes to the pharmacy and asks the robot what medicine to take when feeling uncomfortable, only robots that are familiar with the names and locations of medicines can communicate with people.

Another combination is that now when performing specific operations such as grabbing and releasing objects, the robot has also achieved end-to-end based on the big model (directly outputting the robot trajectory after inputting perception information). In the future, the big model will play a role in the overall global planning.

Overall, large models are now auxiliary, but the combination of large and small models may lead to general-purpose robots.

"Late Post": The route of the Milky Way is a small three-dimensional visual model + a basic large model. How to understand it?

Wang He: Just like we humans have System 1 and System 2, there is fast thinking and slow thinking. The former is the ability of the cerebellum, which in robots is skills such as interactive control and dexterous operation, which can be handled by small models; the latter is the ability of the brain, which is cognition, understanding, and planning, and can be solved with a large model.

This is a three-layer system: the bottom layer is hardware, the middle layer is small models that can perform various skills, and the top layer is a basic large model responsible for task planning. After the robot receives the command, the large model is responsible for calling the small model in the middle layer. After the small model is executed, the large model will study the next step based on the result.

Feet are not that important.The hands are the key

"LatePost": They are all humanoid robots. Most companies in the industry have robots with two legs. Your first robot, Galbot, has a wheeled chassis and two hands.

Wang He: The most essential question is, what value can your product bring to the scene? Bipedal robots only solve the problem of passage, but have no operational capabilities. In this way, they can only be used for patrol and inspection, which is no qualitative change from the past use of cars and dogs.

But hands can perform flexible production that traditional robots cannot. Most of them are rough work in labor-intensive industries and are easier to generalize. There is a lot of room for imagination and scenarios, so the upper body is more important than the lower body.

"Late": Which is more difficult, the ability to operate hands or the complex movement of feet? The final form that everyone envisions is a complete human form. Will the company that first develops hands fail to keep up when it wants to supplement the movement ability?

Wang He: Currently, most operations are performed with hands, so we first use "hands" to enter the scene, and use a replaceable, low-cost universal wheeled chassis for the legs. We will first commercialize it and obtain real data.

The problem with double legs is that they cannot be implemented in real scenarios, so companies that make double legs must compete for continuous financing capabilities, and the next three years will be a big wave of elimination. Of course, as the performance of double legs improves and the price is right, we will also replace them with legs.

"Late": Why is doing it together not an option?

Wang He: Because the humanoid robot’s operating ability is not strong enough, it only lacks legs. There are still a lot of jobs that traditional robotic arms cannot handle.

From a practical point of view, the cost and stability of wheeled robots are far better than those of two-legged robots. For the same height, the BOM (Bill of Materials) of two legs is ten times more expensive than that of a wheeled chassis. In addition, two-legged robots are easy to fall, and if they fall, the robot will be completely broken.

The current technical difficulties of legs have yet to be overcome, and they are much behind the scenes of hands. For example, if something falls from a shelf to the ground, no legged humanoid robot in the world can bend down to pick it up.

"LatePost": Squatting is quite easy for humans, so why can't robots do it?

Wang He: The most difficult part is to maintain body balance throughout the whole process. There are several stages in leg balance: the first step is walking, the second step is climbing stairs, which has stumped a number of companies. The third step is bending over, which is difficult because the center of gravity will be exposed. Then there are squatting and split-leg squatting, which cannot be done in the laboratory at present.

The development of legs lags behind that of hands. The same is true for humans. When babies can only crawl, their hands can explore everywhere, but it takes a long time for them to stand up and walk steadily. Many people still fall when they are six or seven years old.

In fact, there was a demonstration of bipedal walking 20 years ago, but today there are only a handful of robots that can walk on the ground for 10 minutes without any problems. The stability of many bipedal robots simply does not meet people's expectations. In the field of embodied intelligence, the development of the brain is ahead of the arms and hands, and the arms and hands are ahead of the legs.

"LatePost": Tesla's humanoid robot Optimus has both hands and feet, and can now work in factories.

Wang He: Optimus' current work scenarios have nothing to do with legs. Grabbing batteries in the factory and patrolling in the parking lot do not require improvement of the leg strength.

And it's hard to calculate: the robot costs hundreds of thousands or even hundreds of thousands of dollars, but its job is to put identical batteries into a five-by-six box with thirty compartments, that is, to put standard batteries in standard baskets, and the positions of the baskets are fixed. Why do we need embodied intelligence for such a thing? Why not use traditionalIndustrial Automation？

"LatePost": Galaxy General's Galbot is sorting medicines in Meituan's pharmacy. This can also be done with a robotic arm, but you also used a human upper body.

Wang He: We created this scene to demonstrate the ability of embodiment. Technology has not yet developed to this stage for things that are too difficult, so we first look for something that can be done. Tesla's scene was originally realized by a robotic arm, and it is not even replacing people. The work completed in the pharmacy is done by humans, which is more difficult than Tesla. Secondly, it cannot be achieved only with industrial automation, because different medicines are not standard products, and different orders are not standard requirements.

Don't take Tesla as a benchmark.Telecontrol cannot solve data problems

LatePost: The lack of data is a difficulty in embodied intelligence: the text data is 15T, the image is 6B, the video is 2.6B, but the robot data is only 2.4M. Tesla and Google both collect data through "teleoperation", that is, let real people wear collection equipment to complete the actions that the robot needs to learn, while Galaxy General is "all in Sim2Real", that is, simulated synthetic data. Why are you different from them?

Wang He: Remote control is not something that startups can afford. Remote control requires hiring many people to repeat various operations. To obtain a valid piece of data, it takes a robot and a person 30 seconds or a minute.

This is where humanoid robots and autonomous driving are very different. Tesla's autonomous driving can allow one million car owners to spend money to buy cars and drive them for hundreds of millions of hours in total, without having to spend extra money to collect data. And driving is just one thing, but there are many types of work in a factory - applying glue, placing batteries, tightening screws... The correlation between different tasks can be strong or weak.

Tesla hired dozens of people to remotely control the battery placement, but there are more operations such as winding and assembly. Tesla has a lot of money and its own factories to buy its own robots, so it can do this, but startups can't.

Just like autonomous driving now has remote monitors, telecontrol can play the role of remote takeover. If a robot has problems working in a scene and there is no one on site, telecontrol can intervene.

"LatePost": So remote control is a game for big companies?

Wang He: This is the story that Musk is telling. We should not take what Tesla has done as the standard. To be honest, this is just a research.

When Google was working on RT (robot transformer, a robot control algorithm), there was an "Every day Robots" team of more than 200 people. After completing RT-1, this department was disbanded because the business model did not exist.

At present, only Chinese embodied intelligence companies that do not have their own path will follow the example of Tesla and Google in the United States. Without sufficient capital as American companies, if they still follow the stories of others, they will only end up in a dead end.

"LatePost": Does this also depend on the amount of data needed to make a general-purpose robot? When it is less than an order of magnitude, particularly wealthy large companies or startups that can raise money may also be able to run the remote control route?

Wang He: Our own experiments have found that, for example, in the task of grasping, when there are one billion grasping data, the robot's success rate can reach 87%. If the amount of data is reduced to one ten-thousandth, that is, 100,000 grasping, the success rate is only 58%. This shows that embodied intelligence also has clear scaling laws, and it has a greater thirst for data.

In the real world, it is difficult to obtain billions of data. Google spent more than ten months and tens of millions of dollars to collect hundreds of thousands of data.

"LatePost": How much can simulation reduce costs?

Wang He: Through simulation synthesis, we can render all 60 images in one second. Compared with collecting data from the real world, synthetic data is almost free. Our second curve is to obtain data from the real world.

In the simulator, we synthesize the motion of each object into 200 videos, and then synthesize the simulation from a single object into a class of objects. This generates a lot of data, which we use to train the robot's grasping ability.

LatePost: Many people believe that the synthetic data obtained by using a simulator (a system that provides a simulated virtual environment) is naturally different from the real-world data, which will affect the training effect. How do you solve this problem?

Wang He: A simulator can never be completely real, but the Sim2Real route does not require a simulator to be completely simulated. It is a process of joint optimization of hardware, algorithms, and simulation.

At this stage, the simulator is a verification tool, and the mathematical and physical model expressed by the algorithm is the core of obtaining the position information.

The simulator does have some limitations. For example, when our hands touch a mineral water bottle, that is, a flexible, deformable hand touches an object that seems rigid but is actually deformable, this process is not point contact but friction, which has not been perfectly modeled in physics.

At this time, our algorithm needs to have strong adaptive capabilities, such as adding tactile sensation, force control, learning the "shape", grasping it and then controlling it, so that the most difficult part of the simulation can be avoided. Another prerequisite is that the hardware must be sufficiently robust (Robust means that the system can run relatively stably under abnormal conditions).

"LatePost": How do simulators and mathematical and physical models and other algorithms work together?

Wang He: We proposed a set of mathematical and physical models to explain how to efficiently search for targets, and then used simulators to verify whether this approach is feasible.

This also involves the difference between reinforcement learning and supervised learning. If it is reinforcement learning, it means multiple interactions with the simulator, trial and error, to find a solution, which will have a considerable requirement for the authenticity of the simulator. Foot-type walking is completely based on simulator reinforcement learning Sim2Real. But this is a trial and error method, and the efficiency is relatively low.

If you can tell the robot how to grasp, you can turn it into supervised learning, which will be more efficient. We use supervised learning to learn two-finger and five-finger grasping.

Consider it from day oneCommercialization

LatePost: Most Chinese companies that make humanoid robots are also making other products. For example, Zhiyuan has commercial cleaning robots, and Zhuji and Yushu are both making robot dogs. In the United States, more companies directly launch humanoid robots. Why is there this difference?

Wang He: The abundance of capital in China and the United States is different. In the United States, since there is money, they do it all at once. Companies like Figure AI and Tesla are all human-like. But Figure AI is now valued at $2.5 billion, and the operations shown in the demo have nothing to do with athletic ability. The bubble in the United States has made them not need to think about problems according to very strict PMF (product market fit).

In August this year, FigureAI released a new robot, Figure 02, which can already perform some assembly demos in BMW car factories.

"LatePost": Do you think the more correct approach is to think about product implementation from the very beginning? Is this too hasty in a cutting-edge field like humanoid robots?

Wang He: On the one hand, it is still a data problem. Embodied intelligence is bound to hardware, so if robots are not deployed in scenarios, it is difficult to obtain a large amount of data. But they cannot be deployed for free in large quantities because the cost of making the body is too high. Large models do not need to obtain data through commercialization because the cost of popularizing them is still much lower than that of robots.

At the same time, robots are still under-polished. If you don’t observe robots in the scene for a long time, you can’t iterate robots to a state where they can work stably. This is also the reason why there are no PPT companies in the robot track.

"LatePost": What kind of product implementation methods do you see?

Wang He: The first step is to perform a single operation on multiple objects in a single environment, such as moving different objects in the same factory or production line. This is what Google RT-1 and Tesla Optimus do now, but Optimus handles fewer objects. Neither of them is truly generalized, that is, universal, and cannot really make money.

The next step is to enable robots to perform the same operations on different objects in different scenarios in the same industry. For example, in industrial manufacturing, robots can be expanded from being able to pick up parts in car factories to being able to pick up all parts in any factory; in the retail industry, robots can be expanded from being able to stock goods in small supermarkets to also being able to stock goods in Walmart. One training session can cover different scenarios in the same industry, which is of great value.

The next step is to handle more tasks and more scenarios across all industries and move towards universality.

"LatePost": The entire industry is now at the first step. How do we choose the first or first batch of scenes?

Wang He: In any industry, as long as there is flexible production but not fully automated, embodied intelligent robots are likely to be implemented. Especially in the manufacturing industry, there are some inconspicuous operations, the demand may be strong, and the technology required may not be complicated.

We need to do it one by one from easy to difficult, from high labor cost to low labor cost, from high demand to small demand.

LatePost: Does picking up medicine at a pharmacy fit the logic you described? Or did you create this scenario because Meituan invested in you?

Wang He: We want to be the first to seize high-profit, high-value scenarios that can be transformed into greater versatility. Our future goal is to enter households.

B to C is more suitable for entering homes than pure to B, so we set up a B to C scenario in retail to deal with people.

"LatePost": When will your first robot be released?

Wang He: We will accept small batch orders in Q4 this year, with a price of 500,000.

"Late": Isn't it too expensive to get the medicine at the pharmacy?

Wang He: We now have two main sales directions, scientific research scenarios and commercial scenarios like Meituan. The prices and configurations of these scenarios are different.

What we sell to scientific research scenarios is a development version with sufficient computing power. What we sell to commercial scenarios does not support development, but will add some functions and cut other unnecessary functions and computing power. For example, the robots are now equipped with OrinX cards, but in commercial scenarios, computing can be put in the cloud.

We have already received dozens of orders for scientific research. In commercial scenarios, our team will be responsible for the entire process, from machines to services.

"LatePost": You once said that Galaxy is expected to control the cost of a set of robots to 50,000 yuan. When will that be?

Wang He: We cannot do it this year, but when we reach thousands or tens of thousands of units, we will keep getting closer to this goal.

"LatePost": There is a joke that the sales of humanoid robots in China are supported by peers such as start-ups and university laboratories.

Wang He: The ceiling for scientific research is definitely low, but scientific research is the first step. It is impossible for a company that has been established for one year to sell a thousand robots, unless they are toys.

"LatePost": We have talked a lot about the current non-consensus in the embodied intelligence industry. What do you think is the current consensus?

Wang He: So far, no embodied intelligence scenario has been able to generate large-scale economic benefits. There is no consensus on how to make money, so there is no consensus on what the product form is, what the technology is, what the industry is, and what the scenario is.

It is a good thing to have no consensus. That is to say, if everyone reaches a consensus, then the final battle will be about cost, resources, and connections. These factors are not what entrepreneurs are good at and are not good for entrepreneurship.

But if we imagine the future, the ultimate goal of technology, bringing it into the home + full human figures + large models, I think everyone can agree with this.

LatePost: How would you describe the large number of new companies pursuing Embodied? AGI 's journey?

Wang He: This is the process of humans playing the role of creators again. The automobile industry is also an industry created entirely by humans, and the same will be true for general-purpose robots in the future. There will also be leading car companies like Tesla among us.

news

A 90s Peking University PhD supervisor builds a humanoid robot, not a Tesla

Introduction

My contact information