Distributed autonomy · general-purpose models

A car that drives on words, not pixels.

Driving is split across a team of general-purpose models. Eight surround cameras each run their own vision-language model and describe what they see in plain language. A triage agent decides which views are worth reading this instant, and a commander model distills that focused subset into a single driving action — steering, throttle, brake. No bespoke driving network; just off-the-shelf models composed over a language bus.

Watch it drive See the architecture

surround cameras, each its own vision-language model

triage agent routing attention across the array

commander distilling the driving action

The loop

A team of models talks its way down the road.

Our system replaces the usual perception-tensor pipeline with a language bus. Eight surround cameras are each wrapped by a general-purpose vision-language model that can emit a short, structured utterance on demand. A triage agent queries only the views that matter this tick — keeping the array cheap to run — and the commander distills that focused set of utterances into control commands. The whole exchange is human-readable.

Surround array 8 cameras

Front Main

“Open lane, gentle left bend in ~40 m. Lead vehicle braking.”

Front Wide

“Cyclist entering frame from the right shoulder.”

Front Narrow

“Far signal green, ~180 m. No cross traffic yet.”

Left B-Pillar

“Adjacent lane clear back to ~30 m.”

Right B-Pillar

“Parked van at the curb. Gap is clear to pass.”

Left Repeater

“Cyclist tracking the shoulder, holding a steady line.”

Right Repeater

“Curb edge sharpening — lane narrows past the van.”

Rear

“Nothing closing from behind. Safe to slow.”

Triage agent

Queries only the views that matter this tick

querying…

8 cams → 3 relevant

Commander

General-purpose reasoning model

distills → driving action

reconstructing scene…

Driving action commander output · per tick

Steering

−6°

Throttle

22%

Brake

light

queried by triage this tick idle channel command emitted

Motivation

Distributed intelligence, built from general models.

Our system splits driving across a team of off-the-shelf models that coordinate in natural language. That architecture trades a little bandwidth for composability, interpretability, and reach that a single end-to-end pixel policy struggles to offer.

Distributed by design

No single model has to do everything. Perception is spread across independent narrators that each own one viewpoint, and a commander coordinates them — intelligence that emerges from the team, not from one monolithic network.

Built from general-purpose models

No bespoke driving net. Off-the-shelf vision-language models narrate; a general reasoning model commands. The demonstrator probes how far composition of general-purpose models can substitute for a specialized stack.

Triage keeps the array affordable

Reading all eight vision-language models every tick would be slow and costly. A triage agent routes attention — querying only the cameras that matter for the current maneuver — so the team scales without the per-frame bill scaling with it.

Every decision has a transcript

Because the commander acts on what the cameras said, each maneuver is backed by a readable rationale — and language is the universal format, so adding a sensor means adding a narrator, not retraining a fusion backbone. Post-hoc, you replay the dialogue instead of probing a latent vector.

See it run

Watch the team talk its way down the road.

The recording shows the live camera feeds, the utterance stream, and the control output side by side — so you can read exactly what the commander knew at the moment it turned the wheel.

What we found

Generic models can understand driving — but not execute it.

We evaluated the system on NAVSIM — 1,000 real nuPlan scenes scored by the official PDMS metric — and optimized the prompts with automated evolution (GEPA). The headline isn't the score we reached; it's what the scores reveal about where language models hit their limit for continuous control.

43.5→54.1%

PDMS after GEPA prompt evolution — a +10.6pp gain, consistent across Gemini and GPT models.

67%

of off-road failures had the right direction, wrong magnitude — the bottleneck is kinematics, not comprehension.

~54%

a hard ceiling — newer models, waypoints, and HD-map context all moved it by roughly zero.

01 · Optimization

GEPA evolves the prompt against the real scorer — and lifts every model +10pp.

GEPA is a DSPy optimizer that runs the pipeline on training scenes, collects the failures, and lets a model reflect on them to mutate the prompt — scored by the official PDMS metric, not a proxy. The gain is consistent and even transfers across model families.

Model	Baseline	+ GEPA	Δ
Gemini 2.5 Flash	43.5	54.1	+10.6
GPT-4o-mini	39.1	50.1	+11.0
Gemini 3.5 Flash (transfer)	47.6	53.9	+6.3

What it actually learned were kinematic boundaries the model doesn't inherently know — mined from 450 training failures:

Above 8 m/s, keep steer within [-0.3, 0.3]
Never steer > 0.5 unless the nav command is a turn
Vehicle within 15 m ahead → brake = True

The lift is mostly one gate: drivable-area compliance climbed 65 → 80%. It's a multiplicative gate, so staying on the road has outsized impact on the score.

02 · Comprehension

The models understand the scene — they just can't name the number.

On 243 scenes where the system went off-road, an independent judge compared the system's stated reasoning against the ground-truth direction. Two out of three times, the intent was right; only the magnitude was wrong.

67%

Correct direction — knew “turn left,” missed “by 0.15 rad at 12 m/s.”

12%

Ambiguous — direction genuinely unclear from the scene.

21%

Wrong direction — a true comprehension miss.

There's a paradox hiding in the optimization: the GEPA-tuned models post a higher PDMS but a lower decision accuracy.

Baseline (Gemini 2.5F) — 46% decisions right, stops 25% of scenes
+ GEPA (Gemini 2.5F) — 34% decisions right, stops 42%
+ GEPA (GPT-4o-mini) — 26% decisions right, stops 60%

The optimizer learned to stop more often because the metric rewards caution — competence and score quietly point in different directions.

03 · The ceiling

Even handed the HD map, it still can't steer.

The most telling experiment: feed the model exact lane curvature, width, and heading as text — the same privileged information a route-following algorithm uses to score 68.7. The result moved nothing.

53.1%

HD-map context as textno improvement vs. 54.1

Direct waypoints, skip physics47.1 · worse

Newer model (Gemini 3.5F)53.9 · no improvement

The model understands “the road curves left.” It cannot translate that into the right numeric steering value — and no amount of extra context closes that gap. The frontier isn't information or model size; it's spatial numeric precision in continuous control. The path forward likely removes the language model from the precision loop entirely: coarse intent from the model, a classical controller for the exact numbers.

Trajectory planner experiment

What if the planner generates trajectories — and the language model just picks?

We paired a kinematic diffusion model (6.7M params, 128 trajectory proposals per scene) with evolutionary search scoring. On the same 1,000 NAVSIM scenes, this planning-first approach reaches 80.2% EPDMS — but adding the language-model commander drops it back to 70.4%.

80.2%

EPDMS with the standalone trajectory planner + heading fix — the strongest result on our 1,000 scene split.

−10pp

Performance drops when the language-model commander joins — the "deceleration trap" actively harms the planner.

82%

of commander actions are DECELERATE — safety alignment makes the language model reflexively cautious.

ES-01 · The deceleration trap

Safety-aligned models say "slow down" even when the road is clear.

Gemini 2.5 Flash, faced with any camera showing vehicles or pedestrians, defaults to DECELERATE. With the exponential weighting (ω_f = 5.0), a trajectory that disagrees with the commander gets penalized by 97% — regardless of its actual driving quality.

Method	EPDMS	Δ vs standalone planner
Always stop (lower bound)	62.2	−18.0
Random single trajectory	67.2	−13.0
Commander + planner (GEPA-optimized)	70.4	−9.8
Standalone planner + heading fix	80.2	—

The self-reinforcing failure: GEPA optimizes against ω_f = 5.0 scoring — so it learns that DECELERATE is "safe" (never causes collisions) and reinforces the bias rather than correcting it.

ES-02 · The scorer gap

Language-model guidance only helps when the planner's scorer is an imperfect proxy for reality.

In NAVSIM's open-loop setting, the scoring function at search time equals the final evaluation — other agents follow recorded trajectories regardless of ego. Any commander signal can only pull selection away from the optimum. Even oracle commanders with 100% accuracy cannot beat the standalone planner.

Method	EPDMS	Δ vs standalone planner
Language-model commander (optimized prompts)	70.4	−9.8
Oracle lateral commander (GT direction)	77.1	−1.6
Oracle speed commander (GT action)	78.4	−2.3
Standalone planner (heading fix, 1000 scenes)	80.2	—

nuPlan: where commanders help

In closed-loop nuPlan (8s horizon, reactive agents), the scorer mispredicts — it gives 0.95 to trajectories that actually score 0.0 when agents react. A commander fills this gap. In NAVSIM, there is no gap to fill.

ES-03 · Implication

Language-model commanders need a scorer gap to justify their presence.

Three experiments converge on the same architecture principle.

Scorer gap
= commander value

NAVSIM (no gap)Oracle commander still loses to the standalone planner

nuPlan (large gap)Commander fills misprediction — adds value

Real-world deploymentLarge gap → commander adds value

The question isn't whether language models can drive — it's where the scorer fails. In open-loop benchmarks the planner's local score is already truth; any guidance is noise. In closed-loop and real traffic, agents react unpredictably, creating the scorer gap that makes a commander essential. The architecture that wins deploys language-model guidance only where the planner's local objective diverges from reality.