Click anywhere to start

Animated F1 Car

PRIX Project

Formula 1 Icon

PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

Maciej K. Wozniak, Lianhang Liu, Yixi Cai, Patric Jensfeltg

An efficient, camera-only, end-to-end autonomous driving model that achieves state-of-the-art performance without LiDAR or explicit BEV representations.

Code will be available upon publication.

Sharp Right Turn

Sharp Right Turn

Sharp Left Turn

Safe Intersection Turn

While end-to-end autonomous driving models show promising results, their practical deployment is often hindered by large model sizes, a reliance on expensive LiDAR sensors, and computationally intensive BEV feature representations. This limits their scalability, especially for mass-market vehicles equipped only with cameras. To address these challenges, we propose PRIX (Plan from Raw Pixels). Our novel and efficient end-to-end driving architecture operates using only camera data, without explicit BEV representation and forgoing the need for LiDAR. PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer (CaRT), a novel module designed to effectively enhance multi-level visual features for more robust planning. We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks, matching the capabilities of larger, multimodal diffusion planners while being significantly more efficient in terms of inference speed and model size, making it a practical solution for real-world deployment.

High Efficiency

Introduced PRIX, a novel camera-only, end-to-end planner that is significantly more efficient than multimodal and previous camera-only approaches in terms of inference speed and model size.

CaRT Module

Proposed the Context-aware Recalibration Transformer (CaRT), a new module designed to effectively enhance multi-level visual features for more robust planning.

Comprehensive Validation

Provided a comprehensive ablation study that validates our architectural choices and offers insights into optimizing the trade-off between performance, speed, and model size.

State-of-the-Art Performance

Achieved SOTA performance on NavSim and nuScenes datasets, outperforming larger, multimodal planners while being much smaller and faster.

PRIX's architecture processes multi-camera images through a visual backbone featuring our novel CaRT module. These enhanced visual features, combined with the vehicle's state and noisy anchors, are fed into a conditional diffusion planner to generate the final, safe trajectory.

PRIX Architecture Diagram

Performance vs. Speed

PRIX outperforms or matches multimodal methods like DiffusionDrive while being significantly smaller and faster, operating at a highly competitive framerate.

Model Performance vs. Speed Chart

Key Benchmarks

NavSim-v1

87.8 PDMS

Top-performing model, surpassing both camera-only and multimodal competitors.

NavSim-v2

84.2 EPDMS

Achieved the best overall score, solidifying its position as the leading model.

nuScenes

0.57m Avg. L2 Error

Outperforms all existing camera-based baselines with the lowest error and collision rate.

Our model correctly handles complex scenarios like busy intersections and can even generate safer trajectories than the ground truth by maintaining a larger safety distance.

Qualitative Result 1

Sharp Left Turn

Qualitative Result 2

Improved Safety Margin

The model generates a trajectory that is safer than the ground truth by keeping a larger distance from other vehicles.

Robustness to Weather Conditions

PRIX demonstrates strong robustness, generating consistent and safe trajectories across clear, rainy, and snowy conditions.

Weather Example

Left Camera (l0)
Front Camera (f0)
Right Camera (r0)

Clean

Left Camera Clean Front Camera Clean Right Camera Clean

Rainy

Left Camera Rain Front Camera Rain Right Camera Rain

Snowy

Left Camera Snow Front Camera Snow Right Camera Snow

Corresponding Model Predictions (Example 1)

Clean Prediction

Clean Prediction

Rain Prediction

Rain Prediction

Snow Prediction

Snow Prediction

Weather Example 2

Left Camera (l0)
Front Camera (f0)
Right Camera (r0)

Clean

Left Camera Clean Front Camera Clean Right Camera Clean

Rainy

Left Camera Rain Front Camera Rain Right Camera Rain

Snowy

Left Camera Snow Front Camera Snow Right Camera Snow

Corresponding Model Predictions (Example 2)

Clean Prediction

Clean Prediction

Rain Prediction

Rain Prediction

Snow Prediction

Snow Prediction

Trajectory Refinement via Diffusion

The model generates initial trajectories and refines them over diffusion steps. The final selected path is shown in red, with the second-best alternative in blue.

Trajectory predictions 1

Going straight on narrow road

Trajectory predictions 2

Lane Change

Trajectory predictions 3

Right turn

Trajectory predictions 4

Left Turn

Hyper Parameters

The table below summarizes the full hyperparameter configuration used for the PRIX model. We group settings for the backbone, detection and planning heads, batching and precision, optimization, distributed training, and loss weights.

Hyperparameter Value
Backbone configuration
Image backbone ResNet34
Shared CaRT dimension 512
Number of CaRT self-attention layers 2
Number of attention heads 4
Heads configuration (detection and planning)
Max number of bounding boxes 30
Segmentation feature channels 64
Segmentation number of classes 7
Trajectory output (x, y, yaw)
Batching and precision
GPUs 8 × A100 (40 GB)
Per-GPU batch size 64
Mixed precision bfloat16 (AMP)
Gradient clipping 0.1
Optimization
Optimizer AdamW
Initial learning rate 1e-5
Weight decay 1e-3
AdamW (β₁, β₂) (0.9, 0.999)
LR scheduler MultiStepLR
LR decay factor 0.1
Param-wise LR multiplier (image encoder) 0.5
Loss weights
Trajectory loss weight 10.0
Agent classification weight 10.0
Agent box regression weight 1.0
Semantic segmentation weight 10.0

Additional Experiments

All experiments in this section are performed on Navsim-v1 unless noted otherwise.


Backbone Capacity, Speed, and Stability

  • Aim: To identify the backbone (ResNet34, ResNet50, ResNet101) that offers the best balance of performance, speed, and stability.
  • Method: PRIX was trained with each backbone, and PDMS (mean ± standard deviation over five runs), parameter count, and FPS were measured.
  • Result: ResNet34 provides the best overall trade-off. It is the fastest model (57.0 FPS) and the most stable (87.8 ± 0.1 PDMS), while showing only a minor performance difference compared to the larger and slower ResNet101.
Backbone comparison (mean±std over 5 runs)
Model Backbone PDMS Params FPS
PRIX (default) ResNet34 87.8±.1 37M 57.0
PRIX-50 ResNet50 87.8±.2 41M 47.3
PRIX-101 ResNet101 87.9±.4 58M 28.6

Ablation on Loss Weights

  • Aim: To determine the best weighting for the detection loss and the semantic loss.
  • Method: A grid search was performed over different weight combinations, and the PDMS score was recorded for each setting.
  • Result: A low detection loss weight and a high semantic loss weight perform best. The configuration with detection weight 1 and semantic weight 10 yields 87.8 PDMS. Performance increases consistently as the semantic loss weight increases.
Comparison of different weights for the loss parts.
PDMS score heatmap for different loss weight combinations.

Sensor Failures

  • Aim: To test PRIX's robustness to sensor failures, such as camera noise or dropout.
  • Method: Sensor failures were simulated at test time for a standard model. New models were then trained with these corruptions (noise or dropout) to evaluate whether robustness improves.
  • Result: While failures reduce the baseline model's score from 88.7 to 82.2 PDMS, training with noise recovers the score to 84.7.
Robustness to Sensor Failures
Training Method Test-Time Input PDMS Score
Standard (Baseline) Clean 88.7
Standard (Baseline) With Failures 82.2
Train w/ Full Camera Dropout With Failures 83.9
Train w/ Random Noise With Failures 84.7

Impact of Ego Status in the Planning Head

  • Aim: To understand how important the ego status input (velocity, acceleration) is for the planner.
  • Method: We corrupted the ego status at test time (masking it, replacing it with noise) for both a standard model and a model trained *with* status corruption.
  • Result: Ego status is critical, but PRIX can be trained to be robust. Corrupting the status drops the baseline model's score from 87.8 to 66.8 PDMS. However, the model trained with corruption achieves a strong 84.7 PDMS, even with random inputs.
PDMS↑ under test-time ego status modification.
Ego Status (Test Time) Corruption at Training PRIX DiffusionDrive
Status (Clean) × 87.8 88.1
Zero (masked) × 64.4 63.9
Random × 66.8 68.1
Scaled × 80.9 81.3
Zero (masked) 81.3 81.1
Random 84.7 83.9
Scaled 84.4 84.0