the robot invested by openai looks so human-like? investors are amazed: i thought there was a real person under the clothes

2024-09-05

class representative series - the fastest and most comprehensive interpretation of major ai events. this article focuses on the neo robot recently released by 1x technologies, a humanoid robot invested by openai, and uses this as a clue to reveal the special technology path and positioning considerations adopted by the company 1x robotics.

ai future guideauthor: hao boyangzhou xiaoyan

editor: zheng kejun

after visiting the world robot conference, some investors told tencent technology that they would, but see1x, a humanoid robot company invested by openai, producesafter the neo robot, they began to regain confidence.

even wang yuquan, founder of haiyin capital, who has always opposed making robots in human form, was surprised. he told tencent technology, "neo's movements are very natural and coordinated, completely breaking away from people's stereotypes about robots." "when i first saw the 1x robot, my first reaction was even to think there was a real person under the clothes."

the robot invested by openai has started to do housework, and it is so realistic that people suspect it is a human shell

we were also amazed by its smoothness, but in addition to that, what we want to know more is why it chooses to use the "two-legged" mode in the world of the "wheeled" mode in home scenarios?

in the above report, we mentioned that more than 80% of robots serving industrial scenarios will adopt the "bipedal" mode in the design of the lower body. however, the tasks in the home scene are less standardized, more trivial, and more frequent emergencies, which requires home robots to be safe and quiet. compared with the high cost of "bipedal", the immature control algorithm leads to unstable walking and standing, and loud noise, the wheeled operation on flat roads is quieter and more stable.

neo takes a different approach. it is a "bipedal" mode robot that is rarely seen in home scenarios.

in the demonstration video, neo is very "soft".if it weren't for the string hanging behind it, it would look like a real human being collecting wine glasses in the kitchen.

it can predict the next step of humans doing housework without any instructions, relying solely on its own "observation".

the neo moves quietly, but if you turn up the volume in the demo video, you can still hear the subtle humming sound when the neo bends down to pick up the backpack.

unlike many humanoid robots that look tall and strong, neo looks like a boy next door who comes to your home in casual clothes and can help you with housework.

neo is 1.65 meters tall and has 55 degrees of freedom throughout its body. it weighs only 30 kilograms, which is almost 1/3 to 1/2 lighter than most humanoid robots of the same height. however, neo is very strong. according to medium, neo can carry 20 kilograms and its grip is strong enough to lift 70 kilograms (154 pounds).

(figure: weight comparison of humanoid robots in the "adult height" range at home and abroad)

judging from the parameters, neo is small, but its strength is not inferior to the mainstream humanoid robots in the industry. among all the bipedal humanoid robots, only neo is clearly positioned to serve home application scenarios, while other bipedal robots basically serve industrial scenarios.

so how does neo manage to walk around the house so quietly? how does it predict human actions just by observing them? has it overcome the generalization problem of humanoid robots?

where did the design divergence between wheels and legs come from?

bipedal humanoid robots are suitable for industrial scenarios, but they face many challenges once they are switched to home mode.

the core of the challenge lies in the fact that the mechanical structure of the "bipedal" is complex, and more joints need to be mobilized to keep the robot running, which inevitably requires higher power. if it is to be used in home scenarios, it is necessary to solve a series of chain problems caused by high power loss, such as heat dissipation and noise.

in contrast, in industrial scenarios, robots usually work in warehouses or closed factories, which are often equipped with refrigeration or cooling equipment to assist in heat dissipation, so bipedal robots do not need to worry too much about the impact of high temperatures in these environments.

as "workers", the requirements for their appearance are not high. they can be half-naked (with exposed parts) or even walk around the factory with wires hanging. the lack of "clothes" also helps to dissipate heat. for example, boston dynamics' hydraulic atlas can run back and forth "ferociously".

(photo: boston dynamics hydraulic atlas)

in addition, the industrial environment itself is full of various mechanical sounds, so the sound of the bipedal robot's joints moving and the footsteps when walking are less noticeable.

but once switched to a home scene, these problems that are not noticeable in an industrial scene all become bugs: the robot's poor heat dissipation performance may cause a fire, excessive noise may cause neurasthenia, and exposed components pose a huge safety hazard, especially for families with children.

the wheeled type has low power consumption and naturally reduces the troubles such as heat dissipation and noise.

this means that in order to bring "bipedal" robots into home scenarios, they must be optimized and modified from the ground up.

eric jiang, vice president of 1x robotics ai, provided a solution to the production of neo, optimizing the core component of the robot, the "motor". he said in a recent interview,unlike many humanoid robots that use the idea of "small motor, large gear ratio, high kinetic energy", the key code of neo is the motor's "high torque, small gear ratio, low kinetic energy."

so, how do we understand eric jiang’s words? we can first briefly understand the relationship between the “motor” and “gear ratio” of humanoid robots.

analogous to humans, humanoid robots actually have only two types of motion: linear motion and rotational motion. for example, in the 1x demonstration video, neo “waved” to humans for a few seconds. the dissection of this action consists of: first extending the right hand (linear motion), then waving the hand (rotational motion).

if you try to disassemble it, you will find that the entire movement system of a humanoid robot is a combination of these two types of movements.

among them, linear motion is achieved by the "motor + screw" combination of the humanoid robot, while rotational motion is achieved by "motor + reducer". here we focus on the realization of rotational motion. the "motor + reducer" helps the robot complete the "joint" rotation. compared with the wheeled type, the main movement involved in the "bipedal" is also reflected in the joint part.

the core of the "gear ratio" affects the speed of rotational motion, which is the combined speed of "motor + reducer".

in simple terms,the gear ratio refers to theoutput speedandactual execution speed of the componentfor example, if the humanoid robot's legs move at a speed v, a high gear ratio means the motors run at a high speed, and a low gear ratio means the motors run at a low speed.

many humanoid robots have high gear ratios(for example, 10:1), then the motor speed will be reduced by the gear, and the movement speed of the robot joint will be slower. this configuration is more suitable for occasions that require large force but not high-speed movement.

if a low gear ratio is used(for example, 3:1), the motor speed will be reduced slightly, and the robot joints will move faster. this configuration is suitable for situations that require quick response and flexible operation.

neo can reduce the power consumption of core joints by setting a low gear ratio and lowering the output speed of the motor.

the low gear ratio of the motor means sacrificing the motor's operating speed. eric jiang said in his technical document "motor physics" that neo uses "high torque" to make up for the lack of power that may be caused by the low operation of the motor. he also said, "most motors do not have enough power to exert a lot of torque, so mechanical engineers use high-speed motors and add gears to them to exchange speed for torque."

(photo: screenshot of the technical document "motor physics" published by eric jiang, which explains how mechanical engineers exchange motor speed for torque)

this explains why many bipedal robots can only be used in industrial scenarios:“most humanoid robotics companies choose to deploy their robots in factories rather than homes because they rely on rigid, highly geared drive systems. these systems are not safe around people and must be enclosed in cages.”

from this perspective, the 1x team has found a hardware path to allow bipedal robots to operate safely in home scenarios, so neo can wear human clothes without worrying about clothes burning due to poor heat dissipation performance.

in fact, 1x's previous generation robot eve was wheeled, and it was only in the neo generation that it became bipedal. the fundamental reason is still the problem of scene adaptation.

the household scenes are very complex, requiring the robot to reach under the table to get things, pick up things from the counter. a robot with a wheeled chassis must "stretch" its arms to reach some corners of the home because the base occupies space. eric jiang believes that "in this case, the robot should use the change of its own center of gravity to get things like humans." for example, when encountering a situation where something falls in the corner of the cupboard, the robot should be able to lift one leg and put one hand on the table like a human, using the change of its own center of gravity to reach the object.

eric jiang also gave an example in the interview: why do many bookshelves have a certain amount of space at the bottom? "it is to make it easier for people to put their toes in," so that people can get books by leaning against the bookshelf.

therefore, two feet can reduce the robot's movement footprint, while the wheel base cannot be adapted to trivial household scenarios.

this is the logic behind 1x's move from wheeled to foot-based. perhaps, in home scenarios, wheeled models are indeed inferior to foot-based models. in addition, neo also has some "unique" formulas in terms of generalization and data collection.

are generalizable robots already on the threshold?

as a household robot, in addition to safety, the most important thing is to be a truly versatile helper. this requires the robot to be "smart", able to understand the owner's needs, able to operate autonomously, and be generalized.

looking at all the robotics companies invested by openai, the common feature of their products is that they are very "smart", that is, they can combine large models with robots very well.

for example, figure 01's amazing performance is largely due to its ability to understand commands and recognize objects and make judgments, which is the result of the combination of multimodal large models and robots.

another invested company, physical intelligence, has only a website but no products so far. but in an interview, the company said its vision is to "build a general ai model that can be widely used in a variety of scenarios, rather than powering robots that perform repetitive tasks in warehouses or factories."

as for the mechanical part, they even announced that they would not manufacture the hardware themselves, but would purchase multiple types of robots to train their software.

(picture: physical intelligence)

this is more of a furniture model company than a robot company.

and 1x's robot is no exception.

eric jang, 1x's vice president of ai, has extensive experience in integrating large models into robots. before joining 1x in 2022, he led a team at google deepmind in the saycan project, which was one of the earliest attempts to integrate language models with embodied intelligence in robots.

in february of this year, 1x released a video of its eve performing a full neural network task, which became a little popular. at a grasp sfi sharing meeting in april 24, we can see the overall operating logic of this model.

the separation is also a pipeline (workflow form). first, a dit (diffusion-transformer) model is used, combined with natural language commands, to generate a predicted image of the future position of the robot using difussion. then this prediction, the current image, and the target are put into a new transformer model to predict the subsequent mechanical activities required.

from the video, we can see that eve can sort items, carry them, and even charge itself (no wonder it’s called eve). some of these tasks can also be done with both hands. but if you look closely at the video, eve’s capabilities at the time were limited to identifying, grabbing, and placing items, and then combining these basic capabilities into specific tasks, such as packing, carrying, and sorting.

by august or september this year, basically all robot companies involved in the large-model track will be able to achieve these capabilities.

for example, figure 01 released a video at the end of february showing its robot using a large model to make coffee, in which it could even correct its own errors.

(figure 01 brewing coffee in the demonstration video)

however, after this, figure and 1x took different paths in terms of models.

in march, figure chose to directly use gpt-4o, giving its robots strong conversation and logic capabilities. they used a pipeline to integrate the three models.

the gpt-4o large model first recognizes the language and plans the action. then its own neural strategy layer, that is, its own trained end-to-end task model, executes the action. at the same time, it uses its own body control model to keep the robot balanced.

(figure official explanation of its model composition)

after interaction became the highlight of their robot, figure 02 also emphasized the improvement in brain level brought by its 3 times computing power. in terms of models, better integration of openai models became their development focus.

but it wasn't until may 31 that 1x released their language command update. in its demonstration video, the robot can finally understand the task and perform the corresponding operation through voice communication. but even so far, 1x still hasn't used a large high-level language model. in the document on its official website, they mentioned: "after building a dataset of visual to natural language command pairs, the next step is to use visual language models such as gpt-4o, vila, and gemini vision to automatically predict high-level actions." this also leads to their robots lacking the ability to plan complex tasks.

it seems that 1x is a big step behind in terms of intelligence.

but this may be because they are working in different directions. compared with interaction and planning capabilities, 1x is more concerned about the generalization of tasks.

in its official blog in march, 1x explained the model it is building. they are trying to train a "base model" to understand a wide range of physical behaviors, from cleaning and tidying up the home to picking up objects and interacting socially with humans and other robots. after that, they are adding more specific skill series to this model by accumulating more skill training data (for example, one model for general door operation and another for warehouse tasks). in other words, they are trying to build a robot "base model" that supports multi-task generalization.

this is the generalization of task capabilities, allowing a single robot to perform multiple tasks with a single model. this is not unusual, as almost all companies working on robot software are training multiple single tasks. however, in various robot demonstration videos and conference exhibitions, we have rarely seen a robot continuously completing a complex task at the same time, such as cleaning the entire room and then cooking.

this is because no current model can generalize across tasks.

eric jang said in an interview with the robot report, "we have previously demonstrated that our robots can pick up and manipulate simple objects, but to have a truly practical household robot, it must be able to smoothly perform multiple tasks in series." however, this is not simply done by splitting a complex task into multiple tasks through a high-level model like a "brain", because the starting positions and conditions of each task are different.

if a robot has to perform a second task, it has to compensate for the first one. for example, if the first robot failed to reach the right spot next to a table, the second robot will have to extend its arm to grab the object, and the third task will require further compensation. errors tend to accumulate.

1x's solution is to split the model. currently, its model consists of two parts: one is a basic model that understands all tasks and "task chains", and the other is many small models that have a better understanding of specific tasks. it has also become a kind of pipeline.

they developed a natural language interface that allows employees to use voice to guide robots to complete the combined actions of multiple small models and intervene in errors in the process. this allows the models to be connected in series into a longer-term "task chain". these interventions and the data associated with the entire multi-task will be used to train the large "base model". ultimately, they will adjust and train the "base model" through the accumulated task data and "task chain" data, so that this base model can solve both the execution of a single task and the connection between tasks.

(figure: natural language control interface developed by 1x)

therefore, unlike figure’s choice of heavy interaction and planning, 1x’s current core problem is the ability to generalize between tasks. this may be the core bottleneck for robots to become truly universal.

so how is 1x's inter-task generalization progressing?

in the latest documentary, we can see a staff member using voice to instruct a robot to complete the tasks of opening the door, entering the toilet, closing the toilet lid and then walking out step by step. this task is not given all at once, but one by one and connected.

this may not seem very "automatic", but it actually proves that the 1x robot has the initial ability to work continuously between multiple instruction tasks. as long as it has the basic execution ability of "task chain", coupled with the planning ability of cutting-edge models such as gpt-4, it will soon be able to complete complex and continuous tasks autonomously.

eric jang seems to think so. in a blog titled "all roads lead to robotics" in march this year, he wrote, "many ai researchers still believe that general-purpose robots are decades away. but remember, chatgpt was born as if overnight. i think the field of robotics will also usher in such a change."

in his eyes, general-purpose robots that can be generalized seem to be within sight.

but the industry's pessimism is not without reason. their main concern is not about algorithms, but about the lack of sufficient data for embodied intelligence, the difficulty of collecting data, and the lack of standards.

however, a large amount of data is the key to achieving generalization in the scaling law. compared with a simple language model, embodied intelligence may require a larger amount of data to achieve generalization because it contains images and actions. and collecting this data takes a lot of time.

collecting “smart” data using “dumb” methods

eric jang once made a statement in the documentary that was contrary to the general concerns of the industry.“a lot of people overestimate the bottleneck of data collection. in practice, data may become less and less important in the next 12 months.”

his confidence in data comes from past practice. 1x's logic in data collection has always been slightly different from that of other robot companies.

other companies generally use all available means to collect as much data as possible, including using simulated robots in simulated physical environments like unreal 5 to collect large amounts of data, or using video data to capture videos of humans operating objects and extract information.

however, the most commonly used mainstream method at present is to use remote operation (training from demostration), which obtains data by having humans wear vr to demonstrate to robots.

this kind of remote-operated data collection usually places the robot in a very fixed "data collection factory" environment to collect enough data as efficiently as possible, even if there are some duplications and similarities.

(photo: tesla's data collection factory)

according to eric jang, the method they are using is a very "stupid" method. compared with the seemingly efficient centralized collection mode used by tesla, 1x chose to persist in restoring to a variety of life scenarios for collection. so we can see that they collect data in many very different spaces, not in the factory. they also did not use video training and simulation data, insisting on using only remote control to collect data.

(picture: eve's training scenes are surprisingly diverse)

ceo bernt bornich once stated in an interview that “diversity is the most important aspect of humanoid robot data. learning from the diversity in the unstructured environment of consumer robots will make truly intelligent general-purpose robots possible. intelligence comes from the diversity of thought.”

in x1's view, in the home and office environments where robots will eventually land, since they have no fixed structure and are constantly changing with human use, sufficient diversity of data is necessary to be meaningful. therefore, eric jang's 1x data collection formula is "diversity > quality > quantity > algorithm".

in order to achieve this diversity of collection, 1x has specially organized a team of robot operators, all of whom are carefully selected and can personally train some behavioral models through a simple non-editing language graphical interface. in this regard, eric jang wrote in a technical blog, "1x is the first company i know of that allows data collectors to train robot capabilities themselves. this greatly shortens the time required for the model to reach a good state, because data collectors can quickly get feedback on the quality of their data and how much data is actually needed to solve the robot task. i foresee that this will become a common mode of robot data collection in the future."

so they don’t just have data collectors, but a group of data collectors who can fine-tune the models directly. they will identify what is not working in a specific task, collect data for those scenarios, and then retrain and adjust the model, and repeat this process until the model is perfect. data collection and training are integrated.

(photo: on 1x's linkedin, all of these operators are recruited as regular employees, not outsourced, with a monthly salary of $6,000-8,000, which is about 1.5 times the average monthly salary in the united states)

these "dumb" methods ensure the quality and diversity of the collected data, and each data is as "useful" as possible. in an interview in recent days, rric said, "if you deploy robots in a factory, where they repeatedly perform exactly the same tasks, these data are basically useless."

this relatively detailed collection will undoubtedly slow down the growth of data volume, but its effect is very significant.

(top: number of hours of data collected by 1x, bottom: diversity of actions collected by 1x)

according to eric jang's technical sharing, they have collected a total of 1,400 hours of training data involving 7,000 different unique actions until march 2024. he also said that eve robots can currently have hundreds of independent capabilities under the training of these data.

in contrast, rt-2 used 130,000 examples in its training, which took 13 robots 17 months to collect. if each example is 5 seconds, the total length of these examples can reach tens of thousands of hours. it can perform tasks with 700 different instructions.

from this point of view, the effect of refined data collection is indeed good. using 1/10 of the data, at least half of the capability level can be achieved. the saying that more haste makes waste also applies to the robotics industry.

conclusion

overall, 1x’s biggest “killer feature” is its focus on people.

the corporate culture conveyed by 1x reveals a sense of "relaxation". whether it is the previous eve or the recent neo, its promotional videos are completely different from the cold, technological figure. 1x avoids the edge and does not deliberately trigger large-scale dissemination, which can also be regarded as a kind of idealism.

from the neo promotional video, we can see that 1x is creating an image of a "warm man" like the "boy next door". he wears tight casual clothes, showing off his muscle lines similar to those of a human man. he will also take care of his family's daily life gently, pack your package before you go out, and give you a warm hug when you leave.

in addition, it can be seen in the demonstration video that neo can understand human gestures, which is also a deep understanding of human communication. many communications between people do not rely on language, and humans also have times when they are "out of words", so neo can "read" the next step of humans and give an understanding without words, which seems particularly "human".

from the perspective of task generalization and flexible design, neo can be called the first bipedal humanoid robot for home scenarios.

if robots can be eternal in the future, then what kind of robot do we need to accompany ourselves and our descendants? perhaps, the answer of neo is a good option.

news

the robot invested by openai looks so human-like? investors are amazed: i thought there was a real person under the clothes

where did the design divergence between wheels and legs come from?

are generalizable robots already on the threshold?

collecting “smart” data using “dumb” methods

conclusion

introduction

my contact information