The industrial implementation of autonomous driving is accelerating globally.
As of Could 2025, Waymo had 1,500 autonomous taxis working in San Francisco, Los Angeles, Phoenix, and Austin in america, finishing over 250,000 paid journeys per week. Baidu Apollo had deployed over 1,000 driverless vehicles globally, offering over 11 million journeys in whole and masking a protected driving distance of over 170 million kilometers.
Giant-scale implementation could appear to suggest that the expertise is already mature, however that is not the case. There are nonetheless many divergent faculties of thought on autonomous driving which have but to succeed in a consensus.
For instance, by way of sensor options, how ought to one select between a pure imaginative and prescient answer and a multi-sensor fusion answer? Within the system structure, ought to one undertake a modular design or embrace the rising end-to-end structure? Moreover, relating to the way to perceive the world, which is best, VLA or VLM?
These unresolved controversies are main autonomous driving in the direction of an unsure future. Understanding these totally different technical routes means understanding the place autonomous driving comes from, the place it is going, and the way to obtain the expertise’s self-evolution.
The Battle of the Eyes
Pure Imaginative and prescient vs. Multi-Sensor Fusion
All the pieces begins with “seeing”. How a automobile perceives the world is the cornerstone of autonomous driving. On this concern, there have been two long-standing opposing camps, and the talk continues.
The story will be traced again to a problem within the Mojave Desert in america in 2004.
At the moment, the U.S. Protection Superior Analysis Initiatives Company supplied a $2 million prize to draw dozens of prime universities and analysis establishments to take part, trying to reply the query, “How can a car understand its surrounding setting?”
The lidar chosen by groups from Carnegie Mellon College and Stanford College gained. This expertise, which may generate exact 3D level cloud maps, laid the inspiration for the early growth route of autonomous driving and was inherited and developed by Waymo, a subsidiary of Google.
Nevertheless, this faculty has a deadly weak spot: value. A lidar system prices as much as $75,000, costlier than the automobile itself, which implies it may possibly solely observe a small-scale, elite route and is troublesome to commercialize on a big scale.
Ten years later, the imaginative and prescient faculty represented by Tesla took a distinct path.
They advocate simplicity: “People can drive with only a pair of eyes and a mind. Why cannot machines?”
In 2014, Tesla launched the Autopilot system, adopting Mobileye’s imaginative and prescient answer and selecting a vision-based answer centered on cameras. In 2016, Elon Musk publicly acknowledged that “lidar is futile”, formally establishing the pure imaginative and prescient technical route.
The crew simulates the human field of regard by means of eight encompass cameras and depends on deep studying algorithms to reconstruct the 3D setting from 2D photographs. The pure imaginative and prescient answer has extraordinarily low prices and will be commercialized on a big scale. By promoting extra vehicles and amassing extra real-world knowledge, a “knowledge flywheel” is fashioned to feed again into algorithm iteration, making the system stronger with extra use.
Nevertheless, cameras are “passive” sensors and rely closely on ambient mild. In conditions akin to backlighting, glare, at night time, in heavy rain, or in thick fog, their efficiency will considerably decline, far inferior to lidar.
The multi-sensor fusion answer centered on lidar believes that the intelligence of machines can’t totally match human frequent sense and instinct primarily based on expertise within the foreseeable future. In unhealthy climate, {hardware} redundancy akin to lidar have to be used to make up for the deficiencies of software program.
It may be mentioned that the pure imaginative and prescient answer concentrates all of the stress on the algorithm, betting on the way forward for intelligence. The multi-sensor fusion answer focuses extra on engineering implementation and chooses a confirmed real-world answer.
Presently, mainstream automakers (akin to Waymo, XPeng, and NIO) are on the facet of multi-sensor fusion. They imagine that security is an insurmountable pink line for autonomous driving, and redundancy is the one means to make sure security.
It is value noting that the 2 routes aren’t fully distinct however are studying from and integrating with one another. The pure imaginative and prescient answer can be introducing extra sensors, and within the multi-sensor fusion answer, the position of imaginative and prescient algorithms is changing into more and more vital and is changing into the important thing to understanding scene semantics.
The Battle of the Contact
Lidar vs. 4D Millimeter-Wave Radar
Even inside the multi-sensor fusion camp, there is a option to be made:
The price of a millimeter-wave radar is only some hundred {dollars}, whereas early lidar value tens of 1000’s of {dollars}. Why spend a lot on lidar?
Lidar (LiDAR) can assemble extraordinarily detailed 3D level cloud photographs of the encompassing setting by emitting laser beams and measuring their return time, fixing the deadly “Nook Case” (excessive circumstances) that different sensors could not remedy on the time.
It has extraordinarily excessive angular decision and may clearly distinguish the posture of pedestrians, the define of automobiles, and even small obstacles on the highway floor. Within the area of L4/L5 industrial autonomous driving, no different sensor can meet the 2 necessities of “excessive precision” and “detecting static objects” concurrently. To realize essentially the most primary autonomous driving features and security redundancy, the price of lidar is a value that automakers need to pay.
If lidar is already so highly effective, why develop different sensors?
Lidar has extraordinarily excessive efficiency but additionally has its limitations. Laser belongs to infrared mild with a really quick wavelength. The scale of particles akin to raindrops, fog droplets, snowflakes, and dirt is just like the laser wavelength, which is able to trigger the laser to scatter and be absorbed, producing numerous “noisy” level clouds.
The 4D millimeter-wave radar can work across the clock. In unhealthy climate, it may possibly use its highly effective penetration capability to detect obstacles forward first and supply distance and pace knowledge. Nevertheless, the echo factors of millimeter-wave radar are very sparse and may solely type a small variety of level clouds. It can’t define the form and contour of objects like lidar and may additionally produce “ghost recognition” attributable to digital interference. Its low decision means it may possibly by no means be the primary sensor and may solely be used as an auxiliary sensor in automobiles.
Due to this fact, lidar and millimeter-wave radar every have their very own benefits and drawbacks. They don’t seem to be in a substitution relationship however observe a complementary logic of “utilizing millimeter-wave radar to regulate prices in regular situations and lidar to make sure security in advanced situations”, and totally different car fashions have totally different configurations.
L4 Robotaxis and luxurious vehicles often undertake the technique of “primarily utilizing lidar and supplemented by millimeter-wave radar”, piling up sensors no matter value to pursue the last word security and efficiency ceiling. L2+ and L3 mass-produced economic system vehicles primarily depend on “cameras + millimeter-wave radar” and use 1 – 2 lidars at key positions on the roof to type an economical answer.
The talk amongst automakers about sensor choice is basically a technical exploration and enterprise recreation about “the way to obtain the very best security on the lowest value”. Sooner or later, varied sensors might be additional built-in to type various matching options.
The Battle of the Mind
Finish-to-Finish vs. Modular
If sensors are the eyes, then algorithms are the mind.
For a very long time, autonomous driving programs have adopted a modular design. All the driving job is damaged down into unbiased subtasks akin to notion, prediction, planning, and management. Every module has its personal tasks, with unbiased algorithms and optimization targets, like a well-defined meeting line.
The benefit of the modular design is robust interpretability, parallel growth, and simple debugging. Nevertheless, native optimization doesn’t equal international optimization, and the divide-and-conquer mannequin additionally has deadly flaws. When every module processes and transmits info, it can simplify and summary to a sure extent, ensuing within the lack of the unique wealthy info in the course of the layer-by-layer transmission, making it troublesome to attain the optimum total efficiency.
From 2022 to 2023, the “end-to-end” mannequin represented by Tesla FSD V12 emerged and disrupted the normal paradigm. The inspiration for this answer comes from the way in which people be taught. Novice drivers do not first be taught optical ideas after which examine site visitors guidelines; as a substitute, they instantly be taught to drive by observing the coach’s operations.
The top-to-end mannequin not makes synthetic module divisions. As a substitute, by studying an enormous quantity of human driving knowledge, it builds a big neural community that instantly maps the uncooked knowledge enter from sensors to terminal driving management instructions akin to steering wheel angle, accelerator, and brake.
Totally different from the modular algorithm, the end-to-end mannequin has no info loss all through the method, has a excessive efficiency ceiling, and may additional simplify the event course of. Nevertheless, it additionally has the black-box drawback of being troublesome to hint the issue supply. As soon as an accident happens, it is troublesome to find out which step went unsuitable and the way to optimize it later.
The emergence of the end-to-end mannequin has shifted autonomous driving from rule-driven to data-driven. Nevertheless, its “black-box” nature has deterred many automakers that worth security extra, and solely firms with massive fleets can assist the large coaching knowledge required.
Due to this fact, a compromise “express end-to-end” answer has emerged within the business, which retains intermediate outputs akin to drivable areas and goal trajectories within the end-to-end mannequin, searching for a stability between efficiency and interpretability.
The Battle of the “Soul”
VLM vs. VLA
With the event of AI, a brand new battlefield has opened up inside massive fashions. This issues the soul of autonomous driving. Ought to it’s a thinker (VLM) for assisted driving or an executor (VLA)?
The VLM visible language mannequin believes in collaboration and pursues extra controllable processes, and is also referred to as the enhancement faculty. This route believes that though massive AI fashions are highly effective, hallucinations are deadly within the security area. They need to be allowed to do what they’re finest at (understanding, explaining, reasoning), and the ultimate decision-making energy must be given to the normal autonomous driving modules which have been verified for many years, that are predictable and adjustable.
The VLA visible language motion mannequin believes in emergence and pursues the optimum end result, and is called the last word type of end-to-end. This faculty advocates that so long as the mannequin is massive sufficient and there’s sufficient knowledge, AI can be taught all the small print and guidelines of driving from scratch, and in the end its driving capability will surpass that of people and rule-based programs.
The talk round VLM and VLA is sort of a continuation of the talk between the modular and end-to-end options.
VLA has the black-box dilemma of being troublesome to hint. If a VLA automobile all of a sudden brakes onerous, engineers can hardly hint the trigger. Did it misjudge a shadow as a pothole? Or did it be taught a nasty behavior from a human driver? It can’t be debugged or verified, which essentially conflicts with the strict purposeful security requirements of the automotive business.
The VLM system will be decomposed, analyzed, and optimized all through the method. If there’s an issue, engineers can clearly see that the normal notion module sees an object, the VLM identifies it as “a plastic bag blown away by the wind”, and the planning module decides “no want for emergency braking, simply decelerate barely”. In case of an accident, the accountability will be clearly outlined.
Along with the polarization in interpretability, the coaching value can be one of many the explanation why automakers are hesitant.
VLA requires an enormous quantity of paired “video – management sign” knowledge, that’s, enter an 8-camera video and output the synchronized steering wheel, accelerator, and brake alerts. This sort of knowledge is extraordinarily scarce and costly to supply.
VLM is basically a multi-modal massive mannequin that may use the wealthy “picture – textual content” paired knowledge on an Web scale for pre-training after which fine-tune it with driving-related knowledge. The information supply is wider, and the fee is comparatively decrease.
Presently, VLM expertise is comparatively mature and simpler to implement. Most mainstream automakers and autonomous driving firms (together with Waymo, Cruise, Huawei, XPeng, and so on.) are on the VLM route. The explorers of the VLA route are represented by Tesla, Geely, and Li Auto. It’s reported that Geely’s Qianli Expertise’s Qianli Haohan H9 answer makes use of the VLA massive mannequin, which has stronger reasoning and decision-making talents and helps L3-level clever driving options.
Wanting again on the debates amongst totally different faculties of autonomous driving, we discover that these technological debates have by no means ended with one facet fully successful. As a substitute, they’re integrating with one another within the confrontation and shifting in the direction of a higher-level unity. Lidar and imaginative and prescient are being built-in right into a multi-modal notion system; the modular structure is beginning to soak up some great benefits of the end-to-end mannequin; massive fashions are injecting cognitive intelligence into all programs.