can end-to-end bring a new spring? a deep dive into the divided autonomous driving industry

2024-09-25

can driverless cars really be realized?

humans have spent countless time and money developing driverless cars. today, frequent accidents, endless money-burning, and slow progress have caused many puzzles and doubts: is driverless driving a scam, or is the industry dead?

this industry is really one of the most divided industries i have ever seen. different factions have different opinions, look down on each other, and blame each other. after the fight, each faction does its own thing, falls into its own trap, goes bankrupt, and spends its own money.

the result is that before 2024, driverless cars will enter a cold winter.

but this cold winter, as musk claimed to have reconstructed tesla's fsd "through end-to-end ai technology" and announced that he would enter the field of driverless taxis (robotaxi), there seemed to be some new vitality and hope.

can end-to-end lead us to true autonomous driving? are the l2 and l4 defined in autonomous driving really that far apart? how far has autonomous driving technology developed today? is there really no end to the battle between pure vision and multimodality?

in order to explore the development of the driverless industry, we spent three months interviewing the most cutting-edge driverless companies in the global market, including former core employees of waymo and cruise, former tesla fsd engineers, primary and secondary market investors, and more than a dozen professionals in the autonomous driving industry.

we found that the industry is still fragmented and there is no consensus within the industry on many technical routes.

in this series of articles, we will discuss the cutting-edge status of today's autonomous driving technology from multiple angles, including perception, algorithms, products, operations, economics, and law.

in this article, we will first talk about technology in general, and in the next article we will analyze it from the operational and economic perspectives.

1. what is autonomous driving?

let’s first make a conceptual distinction: what is the difference between unmanned driving and autonomous driving?

according to the different levels of intelligence, autonomous driving is divided into 6 levels from l0 to l5:

l0 refers to no automation, l1 refers to driving assistance, l2 refers to partial autonomous driving, l3 refers to conditional autonomous driving, l4 refers to highly autonomous driving, and l5 refers to fully autonomous driving, which is truly unmanned driving.

waymo and cruise, which we mentioned later, as well as the driverless trucks made by hou xiaodi, are all at the l4 level. tesla fsd is at the l2 level, but musk's so-called tesla robotaxi is at the l4 level.

therefore, currently in this industry, when people talk about driverless cars, they are generally referring to l4 companies, because no one can achieve l5 yet; while generally speaking, autonomous driving includes all levels and is a broader term.

let’s take a look at how the autonomous driving industry started.

although humans began exploring driverless cars 100 years ago, it is generally recognized that modern autonomous driving officially originated from the us military's darpa challenge in 2004.

after several years of development, an operating chain of perception-planning-control has been formed. the perception module includes perception and prediction.

the perception layer needs to obtain the road conditions ahead through sensors such as radars and cameras, predict the movement trajectory of objects, and generate a map of the surrounding environment in real time, which is the bird's-eye view we often see on car computers. this information is then passed to the planning layer, and the system determines the speed and direction based on the algorithm. finally, it is transferred to the execution control layer to control the corresponding throttle, brake and steering gear.

later, with the rise of ai, people began to let machines learn how to drive by themselves. they first let the algorithm drive in a simulated digital world, and when the simulation training reaches a certain level, they can start road testing.

in the past two years, as tesla applied the "end-to-end" solution to the fsd v12 version, the operating chain of perception-planning-control has also begun to change.

next, we will focus on the two technical approaches of the autonomous driving industry at the perception level: the pure vision faction and the multimodal fusion faction. these two factions have been fighting for many years, each claiming its own merits. let's talk about their grievances and hatreds.

2. perception: pure vision vs. multimodal fusion

currently, there are two mainstream perception solutions for automobiles.

the first is the multimodal fusion perception solution adopted by many companies, which aggregates and integrates information collected by sensors such as lidar, millimeter-wave radar, ultrasonic sensors, cameras, inertial measurement units, etc. to judge the surrounding environment.

let’s go back to the darpa challenge we talked about in the previous chapter. in the first competition in 2004, although no car finished the race, a contestant named david hall realized the importance of lidar during the competition. after the competition, velodyne, which he founded, began to shift from making audio equipment to making lidar.

at that time, laser radar was still single-line scanning and could only measure distance in one direction, but david hall invented a 64-line mechanical rotating laser radar that can scan the environment 360 degrees.

later, he took this rotating laser radar to participate in the second darpa challenge in 2005. finally, a car with five laser radars on its head finished the race and won the championship.

but this wasn't david hall's car...his car retired midway due to mechanical failure, but his performance did make everyone realize that lidar is a "plug-in".

in the third darpa challenge in 2007, 5 of the 6 teams that finished the competition used velodyne's lidar. since then, lidar has become a hot commodity in the field of autonomous driving, and velodyne has become a leading company in automotive lidar.

zhang hang (senior chief scientist, cruise):

now, whether it is cruise or waymo, the l4-based solutions they are working on are mainly based on lidar, which can directly obtain location information. in this way, the requirements for the algorithm itself will be relatively lower, and a lot of 3d information can be obtained directly through sensors, which will be easier for the robustness of the system, as well as safety, and some long-tail problems.

another technical faction is the pure vision solution represented by tesla, which relies solely on cameras to collect environmental information, and then uses neural networks to convert 2d videos into 3d maps, which include information such as obstacles in the surrounding environment, predicted trajectories, speed, etc.

compared with the lidar solution that directly generates 3d maps, pure vision has an additional process of 2d to 3d conversion. in zhang hang's opinion, relying solely on "video" training data that lacks 3d information will bring certain challenges to safety.

zhang hang (senior chief scientist, cruise):

it needs a lot of training data to learn the lack of 3d information. in this way, there is a lack of supervision. without a reference, it is difficult to get a ground truth in reality. if you want to achieve system security through this semi-supervised learning method, i think it is difficult. i think tesla's main purpose is to control costs, including modifying some shifting mechanisms, all in order to save some parts costs.

but in the view of yu zhenhua, a former ai engineer at tesla, choosing pure vision is not as simple as just saving costs.

1. more means more chaos?

yu zhenhua (former tesla ai engineer):

in fact, tesla's original autonomous driving system is equipped with millimeter-wave radar. sensor fusion is actually a very complex algorithm, and it is not necessarily good even if it is produced.

i had a car at the time, which was one of the last cars with millimeter-wave radar. in 2023, my car was serviced and the service engineer automatically removed my radar. what is the conclusion of this matter? removing the millimeter-wave radar is not for cost, because my car already has a millimeter-wave radar. the root cause is that pure vision has surpassed millimeter-wave radar. so tesla is doing subtraction, removing some redundant things that he thinks are unnecessary, or removing cumbersome things.

yu zhenhua believes thatif the fusion algorithm is not done well, or if good enough results can be achieved through pure vision, then more sensors will become a burden.

many l4 practitioners we interviewed also agreed that more information is not always better. on the contrary, too much additional invalid information collected by sensors will increase the burden on the algorithm.

so, will it work to rely solely on cameras as a sensor, as musk has always advocated?

2. less is more?

musk said that since humans can drive using only two eyes, cars can also achieve autonomous driving based on image information alone. however, the industry has always been concerned about visual deception for pure vision schools, which has indeed caused many accidents in the past.

for example, tesla identified a white truck as the sky and the moon as a yellow light, or ideal identified the content on a billboard as a car, leading to accidents such as sudden braking and rear-end collisions at high speeds.

do these cases mean that pure visual solutions without depth information have inherent deficiencies?

yu zhenhua (former tesla ai engineer):

multiple information streams can indeed provide more information, but you have to answer a question: is the information from the camera itself not enough? or is the algorithm’s ability to mine information insufficient?

for example, when you make an emergency brake or feel a jerk on city roads, the root cause is actually the insufficient estimation of the speed and angle of surrounding objects. if this is the reason, then lidar is indeed much better than the camera because it can provide you with more direct information. the camera itself actually gives you information, but our algorithm is not good enough to dig out such information.

yu zhenhua does not think that the root cause of visual deception is insufficient information from the camera, but rather that the algorithm is not sufficient to process or mine the information provided by the camera. he believes that, especially after the launch of the tesla fsd v12 algorithm, it has been further proved that when the algorithm has been greatly optimized, the mining and processing of camera information has been significantly improved.

yu zhenhua (former tesla ai engineer):

today's fsd v12 is not perfect and has many problems, but i have not found any problem that is due to insufficient sensors. of course, many problems before v12 were due to insufficient sensors, but today's v12 does not have this problem.

however, l4 practitioners have a different view. they believe that cameras have a natural disadvantage.

zhang hang (senior chief scientist, cruise):

i personally think it is difficult, and i don’t think it is necessarily a problem with the algorithm itself.

first of all, the camera itself is not as complicated as the human eye. each camera has some parameters and its limitations.

then as for the algorithm itself, people don't need to know the movements of all the cars within a 200-meter range. i only need to know which cars and pedestrians may affect the behavior of my car. it is enough for me to focus on these points. i don't need a lot of computing power. it may not be possible to reach this level through algorithms in the short term. i think lidar is just a supplement.

zhang hang, who is engaged in l4 research, believes that cameras cannot compete with human eyes, mainly because the focal length and pixels of cameras are fixed, while human eyes are very accurate and can automatically zoom. at the same time, the human's jumping thinking mode cannot be applied to computers in the short term, so the use of laser radar can make up for the shortcomings of cameras.

but there are other opinions on the market that in addition to visual information, other sensors can also bring interference information.

for example, lidar also has its own flaws. since it uses laser to measure distance, it will be interfered with by some reflective objects, rainy and snowy weather, or lasers emitted by other vehicles, which will eventually cause an illusion effect.

liu bingyan (person in charge of kargo software):

i am a firm believer in pure visuals. the world is designed for people and vision. in other words, apart from vision, the information you collect can be considered as interference. of course, you can collect it, but what is the distribution of the interference provided by that information and the real value it provides? i think that when vision is getting better and better, it may be completely the opposite.

if a multi-sensor fusion algorithm can be developed to allow the lidar and image information to verify each other, the security of the system may be further improved.

hou xiaodi proposed a vivid metaphor:when two students of the same level take an exam, the one who uses a calculator will definitely have an easier time. it’s just that their financial situation determines whether they can afford a calculator.

the debate over whether to choose pure vision or a multi-modal fusion solution based on lidar has been going on for several years and it seems that there will be no answer in the short term.for some startups, the route is not that important at all, and the cost and economic accounts are the most important.

hou xiaodi (former founder and ceo of tusimple, founder of bot.auto):

i was once considered a visual person because we couldn’t buy lidar at that time, so we were forced to look for more visual solutions.

i am not against lidar either.when lidar becomes cheaper, i will be the first one in line.lidar is really cheap now, so i am also queuing up to buy lidar. for me, a cat that catches a mouse is a good cat. as long as the cost of the device is low enough and it can provide us with valuable information in terms of information theory, we should use it.

david (host of "big and small ma chatting about technology"):

china's autonomous driving industry has quickly made these hardware, such as lidar and millimeter-wave radar, very cheap. in this situation, should we still use pure vision like tesla? in fact, many companies are now hesitating, should i buy a solid-state lidar for more than 1,000 yuan, or should i use pure vision, which will cause a great waste of computing power.

yu zhenhua (former tesla ai engineer):

i think 1,000 yuan is too expensive. tesla doesn’t even want to use a rain sensor.

wang chensheng (former tesla purchasing director):

but i think that as the scale of the supply chain increases and costs drop significantly, when lidar can be priced similar to cameras, especially in an end-to-end application scenario, is pure vision still the only path?

3. repent?

interestingly, as the price of lidar has dropped significantly, the industry has begun to have different opinions on whether tesla's upcoming driverless taxis will use lidar.

for example, zhang hang believes that since robotaxi does not involve human intervention and the company is responsible if something goes wrong, tesla may choose a more conservative route and use the lidar that it once looked down upon.

zhang hang (senior chief scientist, cruise):

especially when it needs to be responsible for corporate accidents, it needs to be more conservative, and i think it may need an additional sensor. from this perspective, tesla may adopt some technologies that it previously despised.as long as this thing is useful and can achieve its l4 purpose, it will gradually be adopted.

recently, we have also discovered that tesla is also considering some aspects of l4 and l5, and is also discussing cooperation with some lidar manufacturers, so it may be that everyone is reaching the same goal through different paths.

this year, lidar manufacturer luminar released its first quarter financial report, showing that tesla's orders reached 10%, making it its largest customer. however, yu zhenhua disagreed, thinking that this was nothing new.

yu zhenhua (former tesla ai engineer):

first of all, it is definitely not for the use of lidar in mass-produced cars in the future, because luminar's total revenue in the first quarter seems to be 20 million us dollars, 10% is 2 million, which is not enough to install a few lidars. in fact, it is no secret that tesla's engineering vehicles and test vehicles are equipped with lidars. the lidar is used to collect ground truth (true value data) for training neural networks, because it is impossible to manually mark how many meters away an object is from you, and it must be marked with a special sensor.

but i am actually very confused as to why lumina disclosed this in the first quarter, because musk also responded at the time, saying that after v12, we no longer need true value data because it is end-to-end, and occupying the network is a thing of the v11 era. i think there may be some misunderstandings here, that is, from the perspective of financial reports or financial rules.

although it is not certain whether tesla's upcoming robotaxi will be equipped with lidar, one thing is certain: with tesla's current perception configuration, the safety is not yet sufficient to reach l4 or to operate robotaxi.

liu bingyan (person in charge of kargo software):

i am very sure that the existing tesla models all have very clear blind spots, that is, blind spots that are out of visual reach. this blind spot will cause the next car to solve this blind spot problem if it wants to achieve the ultimate, whether it is l4 or l5 autonomous driving.

we will dissect tesla’s latest end-to-end technology updates and the robotaxi details that will be announced in october in detail in chapters 3 and 4. next, let’s explore another important perception technology: high-precision maps.

4. timeless and timeless?

in addition to lidar, high-precision maps are also a major cost component of autonomous driving perception.

high-precision maps collect road information in advance, reduce the pressure on the perception module to draw 3d maps, and improve accuracy.

coincidentally, the first person to promote high-precision maps was the champion of the second darpa challenge in 2005 - the car owner with five lidars on his head, sebastian thrun.

during the 2004 darpa challenge, google was preparing for the "street view" project. google founder larry page personally went to the competition site to scout for talent. after the competition ended in 2005, page approached sebastian thrun, invited him to join google, and gave him the task of drawing maps.

during this process, thrun and page suddenly realized thatif there is a map that can accurately record all lane lines, road signs, traffic lights and other road information, it will be of great help to autonomous driving., which also establishes the important position of high-precision maps in unmanned driving projects.

however, producing high-precision maps is very expensive. the average cost for autonomous driving companies to collect high-precision maps is about us$5,000 per kilometer. if the 6.6 million kilometers of roads in the united states are to be covered, the collection cost alone will reach us$3.3 billion.

coupled with the frequent maintenance costs of the map, the final consumption will be an unimaginable astronomical figure.

now many car companies have promoted the idea of abandoning high-precision maps and instead building environmental maps locally by vehicles.

an autonomous driving engineer we interviewed anonymously said that these comparisons are more out of consideration of business models. for companies doing robotaxi business, using high-precision maps can increase safety.for car companies, abandoning high-precision maps can effectively reduce costs, so it does not mean that abandoning high-precision maps will lead to a higher level of technology.

anonymous interviewee (l4 engineer):

huawei also has an ideal. their solution is to mass-produce cars. customers may come from various cities, and you can drive them in any city.

the main hurdle for the current mainstream high-precision map is that it requires a map collection process, which is actually relatively time-consuming and labor-intensive, and it also requires professional map collection equipment.

so if you are doing the business of mass-producing cars, you can't say that i have a special map collection car and i will travel all over china for you. this is unrealistic.

l2 companies like tesla, huawei, and ideal have abandoned high-precision maps because they are unable to cover every street and alley.

waymo and cruise, the l4 companies of robotaxi, choose to continue using high-precision maps because they have found thatjust by covering some key cities, we can capture enough market share.

therefore, whether to use high-precision maps has become athis is an economic problem for robotaxi companies, not a technical problem.

minfa wang (former senior machine learning engineer at waymo):

if you only look at the business model of robotaxi and divide the demand for robotaxi in the united states, you will find that the top five cities already account for half of the business volume in the united states. you don't need to let it run everywhere in the united states. in fact, you already have a fairly large market.

similarly, another guest we interviewed who is engaged in l4 autonomous driving trucks also shared that if they want to expand their operating routes, that is, expand the coverage of high-precision maps, they must first measure whether this route is profitable, otherwise they will only lose money to gain publicity.

after all this discussion, there is no unified view in the industry on the perception end. just as hou xiaodi said, a cat is a good cat if it catches a mouse.

next, let's focus on the recent progress in the autonomous driving algorithm level, which has attracted much attention recently, especially the "end-to-end" technology that tesla has been promoting recently. what exactly is it? will it really change the direction of the autonomous driving industry?

3. algorithm: is end-to-end the future of autonomous driving?

1. what is tradition?

the traditional operation chain of autonomous driving is perception, prediction, planning, and finally control.

the perception module must first identify the road through sensors such as cameras and radars, translate this information into a language that the machine can understand, and pass it to the prediction module.

the prediction model will determine the driving trajectories of other vehicles and pedestrians, and then pass this information to the planning module to find the road with the lowest risk, and finally pass the control signal to the operation system.

at this time, the algorithm is mainly driven by the "rule base". engineers need to constantly write various rules, such as slowing down when encountering pedestrians and stopping when encountering red lights. in order to take various situations into consideration, the rule base must cover various possibilities as much as possible. correspondingly, the code is also very long.

what are the difficulties of such an algorithm?

the biggest problem is that the system is divided into different modules, but there will be some loss in information transmission between modules. if the downstream cannot obtain comprehensive information, the difficulty of prediction and planning will increase.

let's take a simple example. everyone has heard of the game of passing a message among multiple people, right? 10 people pass a sentence from beginning to end, but often when the message is passed through multiple people, the details will be lost or tampered with, so that the meaning is completely different when it reaches the last person.

similarly, in the traditional rule-based model, if the upper layer module is not performed well enough, it will affect the performance of the lower layer.

another disadvantage is that the rules are all manually designed and defined, but limited rules cannot cover the infinite possible real-life situations. it is difficult for machines to come up with corresponding solutions to some uncommon and easily overlooked problems. this is called a "long tail case" or "corner case", which will result in very high costs for large-scale implementation.

yu zhenhua (former tesla ai engineer):

another thing is that when it is divided into two modules, i think this technology is difficult to scale. why? every time you add a new task in a complex real-life scenario, you have to add some new interfaces, change the perception, and change the control plan.

take tesla for example. a few years ago, nhtsa (national transportation safety administration) required tesla to be able to detect emergency vehicles, such as fire trucks and ambulances. in terms of perception, you are required to detect this, and then control planning also needs to do this. this is just one task. there may be hundreds or thousands of such tasks. you have to scale. so do you know how many thousands of engineers you have in huawei? about 6,000 engineers, because you will have so many new tasks emerging continuously. the more complex the environment, the more tasks there are. i don't think this is a scalable model.

david (host of "big and small ma chatting about technology"):

this method is still relatively old-fashioned. although it seems to be a relatively flexible methodology for the robotaxi industry, it cannot meet the needs of passenger cars and hundreds of millions of cars that will be driving on roads around the world in the future.

so what are the solutions to these problems? this is where we need to talk about “end to end”.

2. new superstar

in the field of autonomous driving, the current mainstream definition of "end-to-end" is:the information collected by the sensor is passed to the large model based on the neural network without any processing, and the control results are directly output.

in other words, there is no longer a need for humans to write various rules, allowing the algorithm to learn how to drive by itself based on the data fed into it.

yu zhenhua (former tesla ai engineer):

when we humans drive, we don't judge the speed and angle of a car in our minds. we make decisions subconsciously in a complex environment.

"make the algorithm more human-like, because that's how humans work." this kind of thinking logic is exactly the direction that musk is taking in leading tesla. it is no wonder that "end-to-end" technology is not new in autonomous driving, but tesla was the first to develop it.

although tesla will launch the "end-to-end" fsd v12 for the first time at the end of 2023, "end-to-end" is nothing new in the field of autonomous driving. in fact, as early as 2016, nvidia proposed "end-to-end" in a paper.

now, "end-to-end" is divided into two types. one is to replace some modules with neural networks. this modular "end-to-end" is only a transitional form, not a complete one, because information needs to be transmitted between modules, and various interfaces still need to be defined, resulting in data loss.

in the mainstream view, only when multiple modules are integrated into a whole and definitions such as perception layer, prediction layer, and planning layer are removed, can it be considered pure "end-to-end".

in 2023, the best paper of cvpr, "planning-oriented autonomous driving", proposed that in the past, "end-to-end" either only ran on some modules or required some components to be inserted into the system.

this paper proposes the uniad model architecture, which is the first to integrate all perception, prediction, and planning modules into a transformer-based end-to-end network framework.

compared with the traditional rule-based execution chain, "end-to-end" no longer requires algorithm engineers to repeatedly improve the rule base. that is why when musk released fsd v12, he claimed that "its code has been reduced from 300,000 lines to 2,000 lines."

although tesla did not invent the "end-to-end" technology in autonomous driving, it is indeed the first company to develop the "end-to-end" technology of neural networks and bring it to the mainstream market.

3. “end-to-end” advantage

in november 2023, tesla released the first test version of fsd v12, but it was only open to selected employees. in early 2024, tesla began to open the fsd v12 version to all tesla owners in the united states, and each owner had a one-month free trial.

after the launch of fsd v12, it caused a sensation for a while. from the perspective of user experience, we can see that most public opinions believe that it is a huge improvement over the previous tesla fsd function. many people even think that this is the "chatgpt moment" in the field of autonomous driving.

david (host of "big and small ma chatting about technology"):

what really makes me feel that there is progress is planning. for example, passing a roundabout is actually quite difficult to do in the traditional planning direction, because the car in front of you wants to squeeze in and you still have to exit the roundabout. how do you set the priority in the process?

even if you set the priority, you still have to keep a certain distance from the car in front and the car next to you before you can get out. this is actually a very complicated logic, but the performance of this on the new version of fsd really amazed me. it was a big surprise for me.

many people who have experienced fsd v12 said that this system, which learns through human driving data, has a very human-like driving style and no longer has the frustration brought by mechanical algorithms.

but at the same time, some guests also thought after the experience thatfsd v12 is not yet good enough to be a must-have, and there is still a certain gap between it and l4.

justin mok (chief investment officer of a family office):

but it is not as good as gpt4. it is not good enough for me to use it, or to use it immediately, and it is suitable for use in many scenarios.

minfa wang (former senior machine learning engineer at waymo):

its performance on the highway is relatively good, but on the street, i feel that i need to take over manually basically every 5 miles or so.

especially in what we call unprotected left turn, it is relatively easy to do some things that make me feel unsafe. if your mpi (mileage of takeover) is only 5, then it is obviously still some distance away from l4 autonomous driving.

i have also experienced the fsd 12.4.4 version myself. compared with l4 vehicles such as waymo, the current tesla fsd still scares me at times, or sometimes exhibits inexplicable behavior.

for example, when making a right turn, its turning radius was too large and it almost hit an oncoming car, so i had to take over manually.

from the performance point of view, the "end-to-end" fsd v12 still has room for improvement. from the perspective of engineering, operations and management, the "end-to-end" advantages are as follows:

first, it can make the overall system simpler.after removing the rule base, you only need to continuously add training cases to further improve the model performance, and the maintenance and upgrade costs will also be greatly reduced.

second, save labor costs.since “end-to-end” no longer relies on a complex rule base, there is no need for a large development team or even experts.

third, it can achieve wider promotion.as you can see, currently l4 companies can only operate in limited areas. putting aside the restrictions of regulations and licenses, this is because it is not an "end-to-end" solution and needs to be optimized for specific regions. "end-to-end" can handle all road conditions and is more like a "universal" driver. this is one of the reasons why tesla fsd v12 is compared to chatgpt.

since “end-to-end” has so many advantages, can it solve the technical problems currently facing autonomous driving?

4. black box model

many of the guests we interviewed believed thatat this stage, further developing the end-to-end route is a recognized trend in the field of autonomous driving, but there are still many problems.

zhang hang (senior chief scientist, cruise):

i think this is the right direction. we cannot create a large-scale l4 solution by patching all the time. however, i think it is impossible to quickly achieve an l4 solution through an end-to-end solution. so now is a contradictory time point.

why is there still a certain gap between the current end-to-end and l4? this has to start with its uncertainty.

end-to-end is like a black box, which brings more uncertainty.

for example, engineers cannot verify whether the input data cases have been learned by the model; or when encountering a bug, they cannot locate which link has a problem; or whether the newly added data will cause the learned knowledge to be forgotten or overwritten. this situation is called catastrophic forgetting.

for example, the tesla fsd 12.4.2 version was already developed internally, but it took a long time to push it on a large scale. musk explained that this was because there were a lot of manually taken-over videos in the data fed, which actually caused the model level to regress.

since the essence of end-to-end is imitation, if the situation encountered happens to have similar cases in the training data, it will perform very well, but if it exceeds the existing reference cases, it will perform worse. in other words, end-to-end has very high requirements on the amount of training data and the richness of cases.

zhang hang (senior chief scientist, cruise):

that is, when the traffic light is red at an intersection, you must not run a red light. it is such a simple rule. if it is a heuristic-based algorithm, we can simply use an if else statement to achieve this effect.

however, if it is a completely end-to-end model, it is completely dependent on learning, and it is actually very difficult for it to learn such a path in the end. so i think there is still a big gap between end-to-end and l4 in a short period of time, and i think this algorithm is immature.

liu bingyan (person in charge of kargo software):

you don't have some hard and fast rules, that is, all the things you set that cannot be done, he can try to do them. so there will be a lot of head-on collisions in the simulation.

at the same time, the unexplainability brought about by end-to-end processing is also a concern for some people.

the so-called unexplainability means that changing any weight, node or number of layers in the algorithm model will have an unpredictable impact on the performance of the model. even the designer and trainer of the model cannot know the intermediate reasoning process.

in contrast, explainability is possible. for example, in the rule-based mode, engineers have written in the rule that “you can continue driving when a plastic bag is detected passing by”, so we don’t have to worry about suddenly braking when encountering such a situation.

liu bingyan (person in charge of kargo software):

as you can see, the display on the screen in v12 is much better, but where does the so-called end-to-end display come from? if the display comes from the original model, then one of the problems involved is that we have actually added a layer of artificially defined interface to the model, so that you can extract the information from a certain position in the model.

another thing that i think is even more terrifying is that the display takes a completely different path. that means that the fact that the car shows that there is a truck in front does not mean that the control model really thinks there is a truck in front. if this point is destroyed, it will be very, very terrifying. you see that there is a car in front of it, but you are not sure that it will not hit it.

i am actually a little skeptical as to whether it is truly end-to-end, or maybe i am not skeptical, but there may be other dangers here.

wang chensheng (former tesla purchasing director):

so for an industry like autonomous driving, which has such high safety requirements, is the unexplainability brought about by the end-to-end model the other side of the coin?

since tesla has not yet announced the fsd v12 technology, we do not know whether fsd adopts a multi-module strategy, but we found thatthere have been cases where car owners have encountered cases where the screen display did not match the actual behavior.

for example, the bird's-eye view constructed by the vehicle showed that there was someone in front, but there was no sign of braking, and the vehicle continued to drive past. fortunately, it was just a false detection on the perception end, and no accident occurred.

although this case shows that under the end-to-end algorithm, upper-layer errors will not affect the advantages of lower-layer decision-making, it also shows that the planning layer occasionally disagrees with the results of the perception layer, confirming liu bingyan's concerns.

will unexplainability become a major problem that hinders the development of end-to-end? next is the third conflict we see.

yu zhenhua (former tesla ai engineer):

i think so.a very serious problem with ai is that its theoretical nature is far behind.

ai doesn’t tell you that this will definitely work or won’t work. so it is an experimental subject, not science, and requires a lot of verification.

v12 completely crushes v11, so this is a question of results. then you would think that the end-to-end unexplainable thing is a no-brainer because it completely crushes v11, so you should just move on.

yu zhenhua believes that as an experimental subject, ai can prove that the direction is correct as long as the results meet expectations, and should continue to advance. hou xiaodi said that the performance of v12 is far ahead of v11, but because the foundation of v11 is too poor, its performance is still far from true autonomous driving.

wang chensheng (former tesla purchasing director):

if it is really full self driving and is restricted to l5, it must pass the regulatory authorities, and they need to have explainability or predictability.

in addition, there are so many cities in the world. in the united states, each city may have different laws and regulations. whether the car needs to adapt to local laws and regulations in terms of hardware and software has become a big problem for scalability.

end-to-end cannot fine-tune the model by manually defining rules, so the ability to adapt to different regulations has become a challenge for end-to-end scaling.

another factor that affects scalability is that end-to-end is more sensitive to data volume and sensors.

5. uncertain future

liu bingyan (person in charge of kargo software):

there is a very severe problem in end-to-end.it will be more sensitive to the sensorthat is to say, when you change the sensor or the distribution of the sensor, your model can be said to have to be trained completely from scratch.

from another perspective, it is unacceptable from an engineering perspective, or we cannot imagine that the same car will be running on the roads all over the world in the future.

once the sensor distribution is changed, the model will become invalid and training will have to be restarted. a large amount of data must be collected for training, which will inevitably incur huge costs.

according to cnbc, a us financial media, by early 2023,to train fsd, tesla used more than 10 million driving videos of tesla owners.

moreover, these more than 10 million pieces of training data cannot be used casually. they must be from human drivers with relatively high driving skills, otherwise the level of the model will only get worse and worse.

therefore, training an end-to-end model requires not only a large amount of data, but also complex screening, which consumes a lot of manpower. this may not be a problem for tesla, which sells a lot of cars, but for other companies, the source of data has become a big problem.

david (host of "big and small ma chatting about technology"):

many oems have been fooled because they blindly pursued tesla's methodology. this methodology is indeed not suitable for 90% of oems.

does that mean that other manufacturers really cannot enter the end-to-end field?

although both nvidia and tesla use pure vision to drive the operation of end-to-end algorithms, the end-to-end can actually accept multimodal inputs.

the positions of currently commonly used sensors such as millimeter-wave radar, lidar, and ultrasonic radar on vehicles are relatively fixed, especially lidar, which is basically on the roof. therefore, by using end-to-end multimodal access, data collected from different models can be used to train models, and there will be more design space left for oems.

after another round of discussion, each algorithm has its own advantages, and it is still unclear which method can lead us to a fully autonomous driving future.

zhang hang (senior chief scientist, cruise):

i don't think there is any algorithm that is simple, scalable, and can reach the l4 standard. i think this algorithm itself does not exist. this field is a field that everyone should work together to promote. i am very optimistic that everyone will reach the same destination, although everyone will have slightly different deviations.

6. nothing can be done

no matter which algorithm is used, it will eventually face the long tail problem.

in the traditional rule-based model, writing a rule base requires a large team and a lot of effort, and it is difficult to cover all aspects. so with end-to-end, can the long-tail problem be solved?

minfa wang (former senior machine learning engineer at waymo):

he solved the common cases, but i think the long-tail problem will still exist.

minfa believes that the fault tolerance rate of the autonomous driving system is very low. if a black box system is to be used on l4, other safety mechanisms must be introduced, but this brings us back to the cost issue under the rule-based model.

the autonomous driving algorithm will first be practiced in a simulation system. can simulation training solve certain long-tail problems?

zhang hang (senior chief scientist, cruise):

currently, there is no good solution that can generate simulation data that can really help our real-world road performance.

minfa wang (former senior machine learning engineer at waymo):

in areas like autonomous driving or robotics, the environment is extremely complex. if you want to simulate, you are not only simulating yourself, but also how the car will move in the future. the main difficulty is that when the trajectory of your own car changes, you will affect the behavior of all the cars and people around you and change them as well.

how to simulate well and avoid distribution shift is still an open topic.

since virtual scenes cannot fully simulate the various possibilities in reality, does it mean that the industry currently has no way to solve the long-tail problem and can only rely on the long accumulation of experience?

anonymous interviewee (l4 engineer):

to some extent, yes, but you don't have to be perfect, right? humans are not perfect either, you just have to do better than them. humans also have their accident rates, you just have to do better than that.

hou xiaodi (former founder and ceo of tusimple, founder of bot.auto):

i think the long tail problem is actually a false proposition. i am very glad that you raised this issue.

in my opinion, long-tail problems are like what should i do if i see a crocodile? what should i do if i see an elephant? what should i do if i see a fixed-wing aircraft parked on the highway?

in fact, for many long-tail problems, we wrap them into a large category of problems. if you wrap them into a more general category of problems, it is easy to handle.

for example, we once saw a fixed-wing aircraft parked on a highway. our solution is very simple: just stop the aircraft, right?

is the long tail problem a false proposition or a problem that needs to be solved? everyone may have their own answer to this question. the long tail problem corresponds to when l4 or even l5 can be widely rolled out, so next, let's take a look at the fierce conflict between l2 and l4.

4. will tesla robotaxi succeed? conflict between l2 and l4

1. “it won’t work”

we asked the guests for their opinions before musk announced the postponement of the release of robotaxi. everyone had a very unanimous view that it was impossible for tesla's driverless taxis to be launched this year.

the biggest reason why everyone has such a unanimous opinion is that tesla’s current models do not meet the l4 standard for driverless taxis.

liu bingyan (person in charge of kargo software):

i am very sure that the existing tesla models have very clear blind spots. if it wants to achieve the ultimate, whether it is l4 or l5 autonomous driving, its next car must solve this blind spot problem. and solving this blind spot problem goes back to what we just said, it must adjust the position of the camera sensor, and the immediate result of adjusting these positions is that the previous model will be completely invalid.

from the perspective of visual camera architecture, it is impossible for existing cars to achieve fsd that can be completely unmanned. from this perspective, it must have a new hardware.

zhang hang (senior chief scientist, cruise):

from the sensor perspective, it needs to introduce some redundancy, which may not be required in l2 before.

when industry insiders are pessimistic, what makes musk so confident about launching robotaxi?

yu zhenhua (former tesla ai engineer):

i think it’s mainly due to the technological breakthroughs of fsd v12. given musk’s personality, when he saw fsd v12 at this moment, he felt that robotaxi should be put on the agenda in his plan.

so, can fsd v12 allow tesla to move towards l4 and take on the responsibility of robotaxi? how big is the gap compared with the existing waymo or cruise?

when interviewing hou xiaodi about this issue, his answer showed us another point of view in the industry: that is, the gap between l2 and l4 is very large.

2. “very far away”

hou xiaodi (former founder and ceo of tusimple, founder of bot.auto):

first of all, what tesla is doing is not driverless. what we are talking about today is a solution that removes people and has the software development company take responsibility. this is called driverless. let's not do false propaganda. fsd is called assisted driving, it is not driverless, so they are not the same thing.

at present, l2 assisted driving is widely used in car companies, such as tesla, xiaomi, huawei, xiaopeng, etc., while companies such as waymo, cruise, baidu, etc. that make driverless taxis use l4 highly automated driving. putting aside the written concept definition,the essential difference between the two lies in who bears the responsibility.

hou xiaodi (former founder and ceo of tusimple, founder of bot.auto):

the solution that removes people and makes the software development company bear the responsibility is called autonomous driving. let me tell you a joke, what if tesla kills someone? for elon musk, it's not their business.

therefore, if tesla wants to make driverless taxis, it must take responsibility. so what are the technical differences between assisted driving and autonomous driving?

hou xiaodi (former founder and ceo of tusimple, founder of bot.auto):

what are the core issues that l4 autonomous driving needs to solve? safety and redundancy, is when every module of a system may fail, the system can still guarantee the bottom line of safety. this is the most difficult and most critical part of l4. before making money, it must first solve the problem of safety, but this is not tesla's design purpose at all.

another l4 autonomous driving researcher also analyzed the differences between l2 and l4 from the perspectives of hardware and software.

zhang hang (senior chief scientist, cruise):

for the l4 solution, first of all, we have relatively powerful sensors, which may be difficult to use in l2 scenarios, at least we won’t use such high-precision lidar.

from the algorithm perspective, l2 companies may focus more on some more efficient ways to reduce costs, and do not need particularly expensive sensors, and may achieve such an effect with less calculations. these l2 companies do not actually need to consider such one-in-a-million cases.

what we are pursuing in l4 is that human remote assistance is only needed once every million miles. we are pursuing this one-in-a-million case.

to summarize:the l4 solution uses sensors with higher accuracy, the chip has more computing power, and can handle more comprehensive scenarios.

however, in the l2 solution, the primary consideration is cost, so the hardware level will be slightly lower. at the same time, in order to adapt to the lower-level hardware, the algorithm will focus more on efficiency rather than security. in this way, the takeover frequency of l2 will be much higher than that of l4.

so, can a l2 company like tesla achieve l4 results by improving hardware and software?

3. “two different things”

hou xiaodi (former founder and ceo of tusimple, founder of bot.auto):

i do not support the route of l2 slowly evolving into l4 and l5. i think this is another false proposition with a strong extrapolation attribute.

in the future, can dolphins evolve into civilization? i think it is possible, but we must know that the earth's civilization can no longer accommodate dolphins to evolve, because there are already companies that have done it. my company is to be able to land l4 as quickly as possible. after i land, you will have nothing to do, right? when homo sapiens picked up the javelin, there was no dolphin to create civilization.

in hou xiaodi's opinion,existing l4 companies have already built up technical barriers. under fierce competition, there will be no opportunity for l2 to evolve.at the same time, some people believe that this does not mean that l4 technology is more advanced than l2, but that they are targeting different scenarios.

yu zhenhua (former tesla ai engineer):

if l4 is really superior to l2 as everyone imagines, and is absolutely technologically advanced and leading, then i would like to ask why l4 technology cannot be directly downgraded to l2?

in fact, in the past many years, l4 companies have been helping car manufacturers to develop l2 due to revenue pressure, but they cannot simply downgrade and basically have to redevelop it.

we also know that in the united states, gm (general motors) owns cruise l4, and ford owns argo ai, which is also an l4 company. why can't gm use cruise's technology in its mass-produced cars? why can't ford use argo ai's l4 technology in its mass-produced cars? so l4 is not absolutely more advanced than l2. in terms of technical difficulty, i don't think that if you do l4, you will appear to be very advanced.

why can’t l4 technology be directly downgraded to l2? zhang hang explained that since l4 uses higher-specification hardware, while l2’s algorithm must adapt to lower-specification sensors and processors with less computing power, the two technologies cannot be directly migrated.

just like an architect who has his computer confiscated and is only given a ruler with low precision and paper and pen, he has to adapt to the new way of drawing.

zhang hang (senior chief scientist, cruise):

what you just said is the problem of computing power. the l2 solution cannot support it. putting a supercomputer in the trunk of a car is an unrealistic solution.

at the same time, zhang hang also showed a more open mind in the comparison between l2 and l4 technologies. l2 has a wider coverage, needs to face more scenarios, and only needs to solve basic problems. l4 has a limited coverage, but pays more attention to various details. so each has its own advantages and disadvantages.

zhang hang (senior chief scientist, cruise):

l4 itself cannot be used as an l2 solution by simply simplifying the existing system and removing redundancy, but vice versa. if l2 wants to achieve the standard of l4, it will take a long time to hone, collect data for a long time, and then accumulate experience.

but i don’t think that our technical route or technical depth will be higher than l2. i don’t think this is necessarily the case. many l4 algorithms may not be very cutting-edge, but they are designed very carefully to solve some very detailed long-tail problems.

which view do you support? you can leave a message to tell us. in our interviews, different people will have their own answers to this question.

yu zhenhua (former tesla l2 engineer):

i think that the general public, and even some l4 companies, will instill a concept in everyone that l4 technology is better than l3 and then l2. i think this is a misleading way to get rid of its restricted scenarios, because the current l4 robotaxi has very restricted scenarios and must be operated in a specific area. for example, waymo can only operate in one area at a time.

shao xuhui (managing partner investor, foothill ventures):

personally, i am still optimistic about l4 companies, because logically speaking, l4 can attack by reducing dimensionality, but as for l2, if you only do this, you will not be able to move up, or it will be very, very difficult to move up.

anonymous interviewee (l4 engineer):

in fact, i don't think there is a particularly difficult threshold in the technology stack. for example, a company can claim to be an l2 company today, and maybe tomorrow it can add some new technologies and do l4, right? it all depends on what technology it uses in its applications, or what new technological breakthroughs it has, right?

hou xiaodi (former founder and ceo of tusimple, founder of bot.auto):

assisted driving and unmanned driving are two different things.

producers: hongjun, chen qian, author: wang ziqin, editor: chen qian

news

can end-to-end bring a new spring? a deep dive into the divided autonomous driving industry

introduction

my contact information