news

Wang Xiaogang of SenseTime Jueying: Even if the two-stage end-to-end model is used for another ten years, it will not become the "ChatGPT" of intelligent driving

2024-07-15

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


Wang Xiaogang, co-founder and chief scientist of SenseTime and president of Jueying Intelligent Vehicle Business Group

At the just-concluded WAIC 2024, SenseTime Jueying released a one-shot video.

In the video, a UniAD vehicle equipped with only 7 cameras can not only freely shuttle through construction roads, large intersections, and traffic light intersections in the city without any map, but also successfully pass through asymmetric intersections without marking lines on rural roads with complex traffic conditions, avoid stationary vehicles parked on the roadside and vehicles on narrow roads, and turn right on large curvature curves without lane lines.

This series of smooth driving movements is impressive, and it is based on UniAD, the industry's first end-to-end autonomous driving solution that integrates perception and decision-making.

In the past few years, smart driving has always been the focus of car companies, but the actual driving level is often unsatisfactory. After ChatGPT came out, the smart driving industry has also been looking forward to the same qualitative change.

At this time, "end-to-end" points out a direction. Since the beginning of this year, the intelligent driving industry has been paying more and more attention to end-to-end. Whether it is car companies such as Xiaopeng, Ideal, Weilai, and Great Wall, or technology providers such as Huawei, Yuanrong Qixing, and Haomo Zhixing, they have all turned to the end-to-end route.

As early as the end of 2022, SenseTime proposed UniAD, a universal model for autonomous driving that integrates perception and decision-making. DriveAGI is also an iteration based on UniAD. It uses a multi-modal large model to support end-to-end solutions to create the next generation of autonomous driving technology. Even if you encounter an ambulance on the road, with DriveAGI's cognitive ability, the vehicle can accurately identify and understand the target and actively give way.


DriveAGI can not only identify ambulances, but also proactively give way to on-duty ambulances.

After two years of early deployment, SenseTime Jueying's advantages of early entry and fast iteration are gradually emerging - it has cooperated with more than 30 domestic and foreign automakers, covering more than 90 models, and has delivered a total of 1.95 million smart cars. In the process of cooperation, SenseTime Jueying and automakers have found their respective boundaries, played their respective advantages, and are accelerating the arrival of the "GPT moment" of autonomous driving.

If the technical route is wrong, it will be futile even if you get on the bus."

At a time when many players are rushing into the end-to-end market, Wang Xiaogang, co-founder, chief scientist, and president of Jueying Intelligent Vehicle Business Group of SenseTime, recalled to Titanium Media App why they were the first to focus on the end-to-end market.

In 2017, SenseTime and Honda Motor of Japan announced a partnership to jointly develop L4 autonomous driving technology. SenseTime itself started with AI vision technology. At that time, Honda Motor required SenseTime to use only cameras and without high-precision maps to achieve intelligent driving functions, which can be seen as the prototype of end-to-end. Since then, the team has continued to study end-to-end.

Now, although the end-to-end competition is in full swing, a common problem is that the end-to-end technical route has not yet formed the best practice and there are differences in the technical routes.

Wang Xiaogang told Titanium Media App that most of the current end-to-end solutions use the "two-stage" solution, which is easier to implement, consisting of two models: perception and decision-making. "The first stage of perception has already applied neural networks, so there is not much change. The biggest change is in the second stage of planning and control. Originally, this part was implemented by writing rules, but now it is also done by applying neural networks."

However, in his opinion, the "two-stage" solution is to connect two small models together and optimize them end-to-end. In the "two-stage" solution, after the information is filtered by the perception model, there is a lot of loss, and only labels such as people, cars, and objects are left. Therefore, the second stage model is actually just a small model. "The core difference between the two-stage solution and the one-stage solution is whether it is the era of small models or the era of large models."

Wang Xiaogang bluntly stated that even if the "two-stage" solution is used for another 10 years, it will not become the "ChatGPT" of autonomous driving.

Considering these issues, SenseTime Jueying has adopted the integration of perception, decision-making, planning and other modules into a full-stack Transformer end-to-end model from the beginning of research and development, realizing a "one-step" solution of integrated perception and decision-making. That is, the sensor input directly outputs the trajectory of the behavior.

In this process, the machine will synthesize information and make judgments like the human brain, just like you are reading a mystery novel. There are various characters and plots in the novel, there are secret rooms, and there are mysteries. You have no idea what will happen next. Through the different characters and plots in the novel, you predict several possibilities for the murderer. What the machine brain does is just like a mystery novel.

However, although there is only one word difference between the one-stage solution and the two-stage solution, the difficulty is very different. Wang Xiaogang explained that with the one-stage route, the amount of video information at the front end is very huge, but the output signal needs to be very accurate, which places higher requirements on the training, data and pipeline of the entire network.

"The 'one-stage' solution is difficult, but once the model is learned, its capabilities will be very strong. This is the 'ChatGPT' moment in autonomous driving that we are pursuing," said Wang Xiaogang.

A pure end-to-end autonomous driving model is not the final answer to autonomous driving."

The selection of a technical route is the first step. At the end of 2022, SenseTime and its joint laboratory proposed the industry's first perception-decision-integrated autonomous driving universal model UniAD, and won the best paper award at the 2023 International Conference on Computer Vision and Pattern Recognition (CVPR) the following year.

At this year's Beijing Auto Show, SenseTime Jueying demonstrated the actual road results of UniAD, which can drive freely on urban roads and rural roads. Then, at WAIC 2024, SenseTime Jueying showed a real-life demonstration of UniAD on complex urban roads and rural roads.

UniAD is a pure vision end-to-end general model for autonomous driving. Although it improves the driving ability of the intelligent driving system, the pure end-to-end autonomous driving model is not the final answer to autonomous driving. Wang Xiaogang said that an important sign for smart cars to move towards super-intelligent bodies is to further have the ability to perceive, reason, make decisions and interact with the open world. Therefore, SenseTime Jueying has created the intelligent driving big model DriveAGI based on the multimodal big model.

The evolutionary direction of DriveAGI is to make end-to-end intelligent driving "explainable and interactive".

The so-called explainability means that vehicles can not only understand the complex real world more like humans, gain insight into the behavioral motivations of various traffic participants, quickly learn various traffic rules, and grasp the ever-changing road information, but also explain the reasoning process of driving decisions to users.

For example, a vehicle normally driving on the right side of a two-lane road, after being equipped with DriveAGI, can immediately identify an ambulance approaching from behind, and can determine that the ambulance is on duty and needs to give way. Therefore, it can immediately determine that there is room to change lanes on the left side of the road, and change lanes from the right side to the left side in time to ensure smooth and fast passage of the ambulance. The whole process is similar to the human brain, which can not only see clearly the different situations encountered on the road, but also think and judge based on traffic rules, and make correct driving actions.

Interactivity means that users can not only ask DriveAGI to explain its decision-making process, but also control the autonomous driving behavior through voice or gesture commands. For example, in the future, in the autonomous driving state, the navigation indicates that the vehicle needs to turn around at the next intersection to reach the destination, but the driver knows that there is a shortcut ahead to turn directly, then he only needs to say "turn left directly" to the system, and the system will execute this instruction according to the current road conditions.

The key to going from black-box operation and one-way output to interpretability and interactivity is how to train the model.

The first element of model training is a large amount of data and large model parameters. Musk has previously talked about the importance of data to autonomous driving models: training with 1 million video cases is barely enough; 2 million is slightly better; 3 million is wow; 10 million is incredible.

Wang Xiaogang also said that the network structure is not a core secret now, and everyone's network structure is relatively similar. The key is how to achieve excellent performance quality under similar network structures. This mainly depends on whether the model scale is large enough and whether the data production pipeline is powerful.

After ten years of deep cultivation in the field of AI, SenseTime has landed in many industries, including urban intelligence, commerce, medical care, finance, autonomous driving, and even in industrial scenarios such as steel, coal mining, and electricity, accumulating a large amount of multimodal data in various industries. On July 5, SenseTime Jueying demonstrated the 8B model vehicle-side deployment solution equipped with a 200 TOPS+ platform at WAIC 2024, which has 8 billion parameters.


SenseTime Jueying in-vehicle terminal 8B multimodal model performance

With quantity, quality must also be guaranteed. Wang Xiaogang said that we cannot just focus on the amount of data and the number of model parameters. If there is no difficult task, even if the amount of data and parameters are increased, the model's capabilities will only remain the same.

Then, he gave an example, saying that bees can work in such a complex hive, and do it so accurately and well, but they only have a single skill and can only do this one thing. The human brain is different. After thousands of years of evolution, humans can send satellites and rockets into space. "This is the difference between general abilities and specific abilities. Bees only do one thing in their lifetime, two lifetimes, or three lifetimes. Just like a model, if you only feed it data about people, cars, and objects, it can only do this one thing for its entire life."

In addition to data, powerful computing power supply is the most scarce and most competitive factor today.

SenseTime is one of the few large computing power suppliers in the industry. Since 2018, SenseTime has begun to deploy computing power infrastructure and built the AIDC in Lingang, Shanghai. It has 45,000 GPUs to provide large model training and inference services, and can train models with hundreds of billions or even trillions of parameters. Relying on AIDC, SenseTime's operational computing power has reached 12,000 P, and it is expected that by the fourth quarter of 2024, the peak computing power will reach 25,000 P.

White box delivery is not excluded, and only when the grass and trees are prosperous can the ecological environment win-win."

No matter how good the technology is, the key still lies in its implementation.

Wang Xiaogang introduced that SenseTime Jueying's mass-produced intelligent driving products have been implemented in multiple brands and models such as GAC Aion LX Plus, Hozon Nezha S, GAC Haobo GT, and Hongqi. Functions such as high-speed NOA have also begun to be implemented. At the same time, Jueying is promoting the delivery of more models. In early June, GAC and FAW were selected as the first batch of L3 pilot projects in China, and SenseTime Jueying provided them with L3-oriented perception algorithms. Not only that, SenseTime Jueying's current multiple mass-produced intelligent driving solutions can be upgraded to end-to-end architectures in the future.

Although they have many customers and orders, technology solution providers represented by SenseTime Jueying have to face a problem - self-research by car companies.

Take Tesla as an example. Its characteristic is that it not only does AI and has a large amount of infrastructure, such as tens of thousands of GPUs, but also has millions of cars produced every year. It has the information and data of end users and has formed its own closed loop.

Will other car companies follow suit? And can they? Wang Xiaogang said that even a company as powerful and humanly rich as Microsoft chose to cut its own AI team and instead cooperate with OpenAI.

At the same time, he explained that the so-called "self-development" does not mean that everything must be done by oneself from beginning to end, the key is controllability. "As long as the car enterprise customers understand and control everything that happens, and can use their own platforms to iterate the product, that's enough."

Therefore, in terms of cooperation, SenseTime used to tend to deliver the code as a black box, believing that it was the most valuable asset. However, Wang Xiaogang revealed that SenseTime now does not exclude white box delivery. Because even if the code is provided, through more in-depth iteration and cooperation, it can quickly improve competitiveness.

In addition, cooperation can also help car companies save money. "We have invested more than 10 billion in large models, and in the process we have built our own infrastructure, large-scale equipment and profitable cloud services, achieving a break-even point. By cooperating with us, car companies will not have to bear this huge investment. Car manufacturers do not need to get involved in these areas themselves, and we will open up relevant resources to them."

However, he also admitted that one of the problems faced in cooperating with car companies is the lack of data feedback. Usually, the feedback of terminal data depends on the active provision of car manufacturers, which may lead to low efficiency of data iteration and circulation. Therefore, in-depth cooperation with car company customers is particularly important.

SenseTime Jueying helps its automotive partners understand big model technology and master know-how through white box delivery. On the other hand, as a partner, the OEMs can share data and information that does not involve privacy and confidentiality with Jueying, so as to train more powerful in-vehicle native big models. The two parties jointly develop and accelerate product iteration to create a truly user-centric smart car native AI big model product.

Based on the industry-leading abundant computing power and the world-leading "daily update" large model capabilities, through a more in-depth strategic cooperation model, SenseTime will create win-win situations with many partners such as OEMs.

SenseTime Jueying has set the time for the end-to-end large model to be put into use in 2025. Wang Xiaogang said that when ChatGPT came out, not everything was done perfectly. For example, when GPT 3.5 was doing tasks, there were many things that were not done well. But the key is that everyone has seen the right direction. There is no problem in following this path, but it will take a few more months of iteration. The same is true for end-to-end.

At the same time, he also confidently stated that when SenseTime Jueying's end-to-end mass production begins next year, users will see things in some scenarios that were completely impossible to do before, and those will be new capabilities that will emerge.

Wu Xinzhou, vice president of Nvidia's automotive business unit, once publicly stated that end-to-end is the final chapter of the intelligent driving trilogy. On the road to the final chapter, SenseTime Jueying deserves special attention and expectations.