dialogue with ideal lang xianpeng: we are already ahead of tesla

2024-08-31

tesla ceo elon musk launched a live broadcast event for tesla's intelligent driving software fsd v12 in 2023. the tesla in the video is based on the latest end-to-end technology. the software deleted a lot of engineer rule code and adopted the main neural network algorithm instead. the vehicle is based on autonomous vision and neural networks to indicate where to slow down, identify traffic lights, and any participants on the road, and make autonomous decisions.

this technology subsequently sparked great discussion and interest from within the industry, technology enthusiasts, and even ordinary car consumers.

this is a new round of paradigm recovery for intelligent driving. to this day, many opinions are still emerging in the chinese market:

the emergence of end-to-end technology has leveled the starting line for many automobile companies engaged in intelligent driving. everyone is standing at the same starting point again, embarking on a long-distance race in data and computing power.

end-to-end technology is affected by strong data. the construction of models and the amount of data acquired, especially the acquisition of effective quality data, affect the rapid iteration of technology.

in the chinese market, domestic automobile manufacturers with the "home advantage" have started another long-distance race in intelligent driving. at the same time, many people believe that tesla's fsd leading advantage will be eliminated at this stage.

dr. lang xianpeng, vice president of intelligent driving r&d of ideal auto, also believes thatin terms of technical architecture, ideal's latest solution is not much different from tesla's, and may even be a little ahead of it, because ideal has the vlm model and system 2, while tesla only has system 1 end-to-end.

ideal auto's end-to-end model is one model, an integrated end-to-end model, which is slightly different from the end-to-end model of other automobile companies on the market.

conventional end-to-end technology only uses artificial intelligence models and mechanical self-learning to replace the perception, planning and control modules in the intelligent driving process. from the visual "input" end to the final "output" end of the intelligent driving system controlling the vehicle's self-driving, everything is completely handled by the model. a pure "end-to-end" technology no longer sets rule-based codes in this process and becomes a complete black box.

however, many end-to-end automotive companies on the market, including tesla, huawei, and xiaopeng, which are relatively advanced manufacturers, still set up certain underlying algorithms responsible for safety redundancy. perception, planning and control may be relatively independent modules, and interfaces still need to be manually defined and connected.

the integrated end-to-end one model aims to put the perception, planning and control modules together, which is ideally called system 1 internally. it is more like a driver that quickly executes end-to-end decisions.

in the past, end-to-end applications in the field of intelligent driving often encountered the problem of high upper limits but low lower limits. for example, the industry leader tesla's tests in california were very smooth and performed very much like a human driver, but once it entered an unfamiliar area, it would make difficult to explain regulatory decisions.

this is one of the drawbacks of end-to-end.

the ideal approach would be to introduce system 2, the vlm visual language model, and take a step forward.

ideally, the vlm visual language model is the world's first large model successfully deployed on a vehicle-side chip, and has the logical thinking and decision-making capabilities to deal with complex scenarios.

in addition to one model end-to-end, system 2 - vlm is another set of model algorithms that assists system 1 in planning and decision-making. system 2 based on vlm can provide the ability to understand complex environments, read navigation maps, and understand traffic rules.

lang xianpeng gave a more popular explanation of this combination: system 1 is like a driver, and system 2 is a driving school instructor. system 1 relies entirely on its own visual perception to perform operations, while system 2 needs to accumulate knowledge over a long period of time to remind and inform system 1.

zhan kun, a senior algorithm expert of ideal auto's intelligent driving, and his team first proposed this concept. they followed the cognitive psychologist and nobel prize winner daniel kahneman, who believed that the human brain has two such systems. the first system is based on experience and intuition, and the second system integrates the logical reasoning ability accumulated through learning.

the two systems serve the entire ideal intelligent driving, which also makes ideal's intelligent driving solution completely different from other car companies.

in the us market, tesla fsd is a leader in both computing power and data.

but in the chinese market, the ideal strategy seems to be to replicate a "chinese tesla intelligent driving road" to move itself a little further forward.

lang xianpeng said,in terms of training computing power and training data in china, we believe that at least for now we are ahead of tesla, because tesla still needs to build data compliance, is subject to certain constraints in china, and has to deploy training computing power in china.”

ideal has also introduced a world model system in the data testing process.

ideal said that the world model supports the large-scale and high-speed iteration of the new generation of ideal intelligent driving, and provides an automated ai capability evaluation system. through reconstruction technology, the problem scenarios encountered by users are turned into "wrong question sets", and through generation technology, the user's real driving scenarios are applied as "simulated questions". the two technologies ensure that wrong questions will not be made wrong again during model evaluation, while also having excellent generalization capabilities.

based on the technical solution of ideal one model+vlm+world model, this also allows the new generation of ideal intelligent driving products to enter a new stage of "supervised autonomous driving".

ideal is the first company to deploy vlm on the orin-x chip, and the first automaker to adopt a dual-system architecture. in china, ideal already has sales of nearly a million units, which is bound to increase the proportion of valid data. ideal auto's current cumulative training mileage has exceeded 2.2 billion kilometers, and it is expected to exceed 3 billion kilometers by the end of 2024. ideal auto's current training computing power has reached 5.39 eflops, and it is expected to exceed 8 eflops by the end of 2024.

however, the industry is still arguing about the application and prospects of end-to-end technology. some people believe that intelligent driving cannot be achieved without 50 billion yuan, while others believe that at least in the next few years, rule-based model algorithms and end-to-end single-module functions will continue to run in parallel, and pure end-to-end is still nonsense.

to a certain extent, the test results of the ideal user experience group have verified the rationality of this path. in any case, ideal auto has taken the lead in taking this step.

dr. lang xianpeng, vice president of intelligent driving r&d of ideal auto, and zhan kun, senior algorithm expert of intelligent driving of ideal auto

the following is a transcript of an exchange with dr. lang xianpeng, vice president of intelligent driving r&d at ideal auto, and zhan kun, a senior algorithm expert at intelligent driving at ideal auto. the conversation has been edited without changing the original meaning:

question: everyone in the market claims that they are end-to-end, but what is true end-to-end?

zhan kun:end-to-end is a research and development paradigm. as the name suggests, it means completing a task from the initial input to the final output without any other processes in between. using one model to go from input to output is the essential meaning of end-to-end. as long as this meaning is met, we can call it end-to-end.

ideal auto is now an integrated one model end-to-end system. through direct sensor input, after the model is inferred, it is directly given to the trajectory planning for controlling the car. this is an integrated end-to-end system with no other steps in between. there is also an end-to-end method, which is to divide the model into two models in the middle, bridge the model with a signal, input a perception model, and input the perception result into the control model, and combine them into a modular end-to-end system. this may also be called an end-to-end system, but we believe that such an end-to-end system is not a true end-to-end system. ideal auto's end-to-end system itself wants to solve the loss of information in the middle. if an artificial information digestion process is added in the middle, the efficiency may not be so high or the capacity limit may be constrained, so we believe that an integrated end-to-end system is a more essential end-to-end system.

question: are we inspired by tesla? are there any differences compared to traditional end-to-end modules?

zhan kun:tesla did mention end-to-end in early 2023, and musk also said on twitter that it embodies a complete model of direct control of the car from input to output. everyone was shocked when they saw this news, because this thing is not new to them. in 2016, nvidia had a model that mentioned end-to-end and published a paper, but the effect was average and only solved very simple scenarios. with the computing power and model scale at the time, everyone thought this path was not feasible.

by 2023, after tesla has added super computing power to the new transformer architecture, a new paradigm may emerge. tesla was not the first to propose end-to-end, but it is moving in a more growth direction. after we saw this, we were also thinking internally that compared to the previous modular model, the more fundamental method of end-to-end is to reduce the redundancy of various information. in terms of graph-free, we are close to modular end-to-end. we have a large perception model, which is actually a modular end-to-end model. even so, we found that the end-to-end model still needs rules, modular data, and modular strategy tasks.

in the discussion and conception of the new solution this time, we proposed that the end-to-end must be more thorough and more essential. ideal auto has very rich data, and we believe that this data can support us to do well, which is our advantage. so we chose the more challenging and more difficult end-to-end integrated architecture, which has a high upper limit, but the disadvantage is that the training score module is more difficult, including data matching and training methods. there are many know-hows that need to be explored and mined, but we still resolutely chose the difficult but correct path.

question: many brands now claim to be leaders, and ideal auto also says that it has entered the first echelon of intelligent driving. how do you evaluate the end-to-end technical level of these companies on the market?

lang xianpeng:from a technical point of view, for ordinary consumers, they don’t care whether it is with or without a map, end-to-end or non-end-to-end. what everyone ultimately cares about is the product and the experience of use, and the value of the product. therefore, we are not trying to compete with anyone, but we hope to provide better products and services to our users. previously, the high-speed noa with high-precision maps met the user's usage needs. next, we tried many ways in the process of doing city noa. one of the simple ideas was to do city noa with a map, but we found that no map vendor could provide high-precision maps of the city, only light maps. but we think light maps are not good, because once the map needs to be iterated, there will be problems with timeliness and whether it can be really used. we can't let users feel that a place can be used today but not tomorrow.

finally, we decided to go with mapless technology. the previous mapless solution was still a perception, planning, and modular solution, which contained a large number of manual rules and real car tests. not to mention the budget investment, the time was very difficult. when the model is iterated, if you want to run all kinds of situations throughout the year, it will be impossible to achieve without one or two years, and users cannot wait that long. so we iterated to the end-to-end + vlm technical architecture. i think this technical solution is essentially an artificial intelligence solution. it is not designed, but grown by itself.

in addition, today i introduced the world model to you. in my opinion, this capability is the most important and necessary guarantee for achieving rapid iteration of autonomous driving.if a model iteration is done using traditional methods, it will require a large number of vehicles, people, and time for testing. however, now we use generation and reconstruction technology to collect scenarios where problems occurred in the past and build our own library of wrong scenarios.before each release, we tested more than 10 million kilometers of wrong questions, and this is an effective set of wrong questions, not a random road test. in addition, we can also generate scenarios and simulate scenarios, which also involves tens of millions of scenario tests. now iterating models in this way is much more reliable than the original whole vehicle or road test method, and it can cover all kinds of scenarios throughout the year. this is what we do. we don’t know whether other friendly brands do this, but we are completely based on user needs. we iterate technology, not for the sake of technology, but because this technology can really solve user needs and bring a better product experience, so we do this.

question: not long ago, someone put forward the view that "intelligent driving cannot be done well without 50 billion yuan". what do you think about this?

lang xianpeng:regarding the 50 billion, we need to determine whether it is a one-time investment or a long-term investment. as mentioned today, we will invest 1 billion us dollars in intelligent driving research and development every year. if it continues for 10 years, the amount will exceed 50 billion.

the end-to-end + vlm technical architecture is a watershed.previously, we were still using traditional methods to do autonomous driving. starting from this generation, we are truly using artificial intelligence to do autonomous driving. in the future, the core competition for autonomous driving research and development will be whether there is more and better data and the corresponding computing power to train the model. the acquisition of computing power and data depends on how much money and resources are invested. some of these things cannot be bought with money, such as training data and training mileage. each car company has its own data, which is not shared with each other.

another thing that needs investment is computing power.we now have a computing power of 539 million eflops, and it is expected to reach 800 million eflops by the end of this year. this is no longer 1 billion rmb, but 2 billion rmb in expenditure, which means 2 billion rmb will be consumed every year. in the future, when we enter the l4 stage, the annual growth of data and computing power will be exponential, which means that at least 1 billion us dollars (600 to 700 million rmb) will be required every year. and after 5 years, it will need to continue to iterate. at this level, it will be very difficult for a company's earnings and profits to support the investment. therefore, we do not need to focus on how many billions of dollars are invested in autonomous driving, but rather, we should start from the essence and see whether there is sufficient computing power and data support, and then see how much money needs to be invested.

question: how to ensure the security of the model when the amount of data is not very large? conceptually, are we also a kind of "two-models" now?

lang xianpeng:"safety" is a highly concerned issue. are there coordination issues? are there independent safety modules? and so on. the reason why everyone has these questions is because everyone is thinking from the perspective of past non-ai autonomous driving research and development. for example, i used to ride a horse, and he would ask me if there is a saddle on the car? this is because everyone has not really understood what is ai practice and what is non-ai practice. this is the first point.

second, many people now claim to have an end-to-end model, but to truly do end-to-end, you still need to look at two capabilities: whether you have enough data and whether you have enough computing power. otherwise, i think it is difficult to do a true end-to-end, because end-to-end is the ai approach.

third, the upper and lower limits of end-to-end capabilities are very high. let me make an analogy. before cnn (deep neural network model) came out, people were still using traditional machine learning methods to do some work, such as the well-known image classification task. at that time, svm algorithms encountered bottlenecks, but cnn crushed them by 10%+ as soon as it came out. what i want to express is that people have not really understood the end-to-end capabilities, and we will not rashly push it to internal test users.

when we use non-ai methods, we need to consider a lot of scenarios in the detail of longitudinal control. so when designing scenario rules, we need to set a lot of conditions to stipulate what strategy to take under certain conditions. but when we trained the first version of the end-to-end model, i found that it was very comfortable at every intersection or under conditions that required longitudinal control. we did not debug for special cases, but the model trained itself. it can be found that we have a big problem when making rules, because the scenarios are too diverse, and we cannot set rules for all scenarios. but when we use ai to make an end-to-end model, we will find that it has this magic. we give it data, and it can really learn the driving experience of these people. not only can it learn the upper limit, but it can also greatly improve the lower limit. although it still has its own limitations, our way to solve it is no longer to set rules, but to give it more and better data.

we also have a safety backup strategy in the control module. because we have an end-to-end sensor input to trajectory output, and the trajectory output is given to the steering and braking modules. in this area, we have a safety backup strategy. for example, if it intends to make a sharp 180° turn, we will restrict it, but there are very few such rules, which can be ignored compared with the previous practice.

at the same time, we have improved the safety bottom line and capability ceiling. our approach is to continuously provide it with high-quality data, and it will definitely learn a lot of safe driving habits.

question: how to ensure that the data provided to end-to-end + vlm is clean?

zhan kun:whether doing end-to-end or vlm, data is the most important thing, and large models have always emphasized high-quality data.so our first step is to clean the data source.we are very strict in selecting driving data. we have an internal scoring system for each car owner, which includes various dimensions and is weighted. for example, whether the driver violates driving rules, whether the driver crosses the line for a long time, whether the driver stops at the stop line, whether the driver suddenly turns the steering wheel or feels uncomfortable while driving, etc. we use a combination of various indicators to score the top 3% of users as "experienced drivers". in the ideal large-scale data environment, even if the top 3% is a very large amount of data, we can still ensure that the data we get is very good. at least the driving behavior is standardized, comfortable, and reasonable. it is good to provide this data end-to-end.

the second layer also involves screening. during the model training process, we also need to match and classify the model samples in some scenarios. there are many extreme and difficult scenarios. there will be evaluation models, evaluation methods, and some rules that can clean up the data and give very detailed label classification to each type of data.

in the last layer, we will know which samples are difficult to learn during the training process, and adjust the learning strategy, including deliberately constructing some synthetic data for reinforcement learning and comparative learning. we make some adjustments to the learning method based on our data, so that our entire end-to-end data and vlm data are well verified and cleaned, and the model we get will be better. this process is not achieved overnight.

lang xianpeng:there is another problem, that of dirty data. our training data volume is still relatively large. even if we have very small dirty data later, because the ability of ai training is not that a single stain can pollute the entire effect, so as long as the amount of accurate data is large enough, a little bit of interference data will not matter much.

question: the technological war in intelligent driving has undergone rapid changes and several major iterations in the past few years. will end-to-end + vlm be a framework with long-term vitality?

lang xianpeng:end-to-end + vlm is an architecture that simulates human thinking and cognition, because we are doing artificial intelligence, and ultimately we hope to achieve anthropomorphism or human-like. i was greatly inspired by the book "thinking, fast and slow", and finally wanted to know how people perceive and think. we think the current artificial intelligence framework is very reasonable, and we are also very happy to see that after we proposed it, many companies in the industry also began to mention the benefits of the dual system theory and are trying to follow up. moreover, the dual system theory can not only be used in autonomous driving, it is also a paradigm for future artificial intelligence and even intelligent robots. autonomous driving can be said to be a wheeled intelligent robot, but its working range is the road. therefore, i think there is a certain long-term behavioral power, but technological development is endless. we will maintain an agile perception of advanced technologies, and we will also track if there are new technologies.

question: how big does ideal feel the gap is between itself and tesla’s intelligent driving technology right now, and when will it be able to catch up?

lang xianpeng:last year, i replied that the gap was half a year, and this year it may be smaller. first, in terms of technical architecture, we are not much different from tesla, and we are even a little ahead, because we have vlm and system 2, while tesla only has system 1, end-to-end. second, in terms of training computing power and training data in china, we believe that at least from now on we are ahead of tesla, because tesla still needs to build data compliance, some constraints in china, and the deployment of training computing power in china. from this perspective, the gap between us and tesla in china may not be that big, and we also particularly hope that tesla can join in, learn from each other, and focus on improving itself.

question: there is a view that the ai path for smart driving is not quite right. it is not believed that this path will work because l2 focuses more on low cost or universality, but l4 can only be universal after the safety issue is resolved. so, can mass-produced cars be l4?

lang xianpeng:first,we believe that everything should start from user needs and user value.any product made by ideal auto must exceed or meet the user's value. we will only make products when users think they are valuable. we believe that users must have demand for autonomous driving, so we cannot design users to only drive l4 in chengdu and not in other places.

second, each brand can discuss and choose its own technology route, whether it is a gradual or leapfrog technology route.however, ideal auto will definitely choose a technical route that meets user needs.we now choose to use artificial intelligence to do autonomous driving. previously, assisted driving was a system assisting people to drive, and the main body was people. but now at the end-to-end + vlm stage, we think that the car is driving itself. after training the complete model, the model is capable of driving the car well. i supervise the car to see if there is something wrong or there is a prompt to take over, but the main body must be the car, and people are a kind of auxiliary role of supervision. if this level is achieved, our users' needs for autonomous driving will be met. this is our logic.

question: does ideal auto have plans to charge for advanced intelligent driving?

lang xianpeng:standard configuration and free of charge are the strategies that ideal has formulated since the first day of entering the intelligent driving market. "supervised autonomous driving" is free of charge for all ad max owners. the delivery volume is relatively good and the company's operations are stable, and there are sufficient resources to invest in intelligent driving research and development. delivery volume is a very important measurement indicator. for us, it is not just about delivery volume, but also about providing more vehicle training mileage for autonomous driving.

news

dialogue with ideal lang xianpeng: we are already ahead of tesla

introduction

my contact information