Distributed by design
No single model has to do everything. Perception is spread across independent narrators that each own one viewpoint, and a commander coordinates them — intelligence that emerges from the team, not from one monolithic network.
Driving is split across a team of general-purpose models. Eight surround cameras each run their own vision-language model and describe what they see in plain language. A triage agent decides which views are worth reading this instant, and a commander model distills that focused subset into a single driving action — steering, throttle, brake. No bespoke driving network; just off-the-shelf models composed over a language bus.
Our system replaces the usual perception-tensor pipeline with a language bus. Eight surround cameras are each wrapped by a general-purpose vision-language model that can emit a short, structured utterance on demand. A triage agent queries only the views that matter this tick — keeping the array cheap to run — and the commander distills that focused set of utterances into control commands. The whole exchange is human-readable.
Our system splits driving across a team of off-the-shelf models that coordinate in natural language. That architecture trades a little bandwidth for composability, interpretability, and reach that a single end-to-end pixel policy struggles to offer.
No single model has to do everything. Perception is spread across independent narrators that each own one viewpoint, and a commander coordinates them — intelligence that emerges from the team, not from one monolithic network.
No bespoke driving net. Off-the-shelf vision-language models narrate; a general reasoning model commands. The demonstrator probes how far composition of general-purpose models can substitute for a specialized stack.
Reading all eight vision-language models every tick would be slow and costly. A triage agent routes attention — querying only the cameras that matter for the current maneuver — so the team scales without the per-frame bill scaling with it.
Because the commander acts on what the cameras said, each maneuver is backed by a readable rationale — and language is the universal format, so adding a sensor means adding a narrator, not retraining a fusion backbone. Post-hoc, you replay the dialogue instead of probing a latent vector.
The recording shows the live camera feeds, the utterance stream, and the control output side by side — so you can read exactly what the commander knew at the moment it turned the wheel.
We evaluated the system on NAVSIM — 1,000 real nuPlan scenes scored by the official PDMS metric — and optimized the prompts with automated evolution (GEPA). The headline isn't the score we reached; it's what the scores reveal about where language models hit their limit for continuous control.
GEPA is a DSPy optimizer that runs the pipeline on training scenes, collects the failures, and lets a model reflect on them to mutate the prompt — scored by the official PDMS metric, not a proxy. The gain is consistent and even transfers across model families.
| Model | Baseline | + GEPA | Δ |
|---|---|---|---|
| Gemini 2.5 Flash | 43.5 | 54.1 | +10.6 |
| GPT-4o-mini | 39.1 | 50.1 | +11.0 |
| Gemini 3.5 Flash (transfer) | 47.6 | 53.9 | +6.3 |
What it actually learned were kinematic boundaries the model doesn't inherently know — mined from 450 training failures:
[-0.3, 0.3]> 0.5 unless the nav command is a turnbrake = TrueThe lift is mostly one gate: drivable-area compliance climbed 65 → 80%. It's a multiplicative gate, so staying on the road has outsized impact on the score.
On 243 scenes where the system went off-road, an independent judge compared the system's stated reasoning against the ground-truth direction. Two out of three times, the intent was right; only the magnitude was wrong.
There's a paradox hiding in the optimization: the GEPA-tuned models post a higher PDMS but a lower decision accuracy.
The optimizer learned to stop more often because the metric rewards caution — competence and score quietly point in different directions.
The most telling experiment: feed the model exact lane curvature, width, and heading as text — the same privileged information a route-following algorithm uses to score 68.7. The result moved nothing.
The model understands “the road curves left.” It cannot translate that into the right numeric steering value — and no amount of extra context closes that gap. The frontier isn't information or model size; it's spatial numeric precision in continuous control. The path forward likely removes the language model from the precision loop entirely: coarse intent from the model, a classical controller for the exact numbers.
We paired a kinematic diffusion model (6.7M params, 128 trajectory proposals per scene) with evolutionary search scoring. On the same 1,000 NAVSIM scenes, this planning-first approach reaches 80.2% EPDMS — but adding the language-model commander drops it back to 70.4%.
Gemini 2.5 Flash, faced with any camera showing vehicles or pedestrians, defaults to DECELERATE. With the exponential weighting (ωf = 5.0), a trajectory that disagrees with the commander gets penalized by 97% — regardless of its actual driving quality.
| Method | EPDMS | Δ vs standalone planner |
|---|---|---|
| Always stop (lower bound) | 62.2 | −18.0 |
| Random single trajectory | 67.2 | −13.0 |
| Commander + planner (GEPA-optimized) | 70.4 | −9.8 |
| Standalone planner + heading fix | 80.2 | — |
The self-reinforcing failure: GEPA optimizes against ωf = 5.0 scoring — so it learns that DECELERATE is "safe" (never causes collisions) and reinforces the bias rather than correcting it.
In NAVSIM's open-loop setting, the scoring function at search time equals the final evaluation — other agents follow recorded trajectories regardless of ego. Any commander signal can only pull selection away from the optimum. Even oracle commanders with 100% accuracy cannot beat the standalone planner.
| Method | EPDMS | Δ vs standalone planner |
|---|---|---|
| Language-model commander (optimized prompts) | 70.4 | −9.8 |
| Oracle lateral commander (GT direction) | 77.1 | −1.6 |
| Oracle speed commander (GT action) | 78.4 | −2.3 |
| Standalone planner (heading fix, 1000 scenes) | 80.2 | — |
In closed-loop nuPlan (8s horizon, reactive agents), the scorer mispredicts — it gives 0.95 to trajectories that actually score 0.0 when agents react. A commander fills this gap. In NAVSIM, there is no gap to fill.
Three experiments converge on the same architecture principle.
The question isn't whether language models can drive — it's where the scorer fails. In open-loop benchmarks the planner's local score is already truth; any guidance is noise. In closed-loop and real traffic, agents react unpredictably, creating the scorer gap that makes a commander essential. The architecture that wins deploys language-model guidance only where the planner's local objective diverges from reality.