CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation

BMVC 2025 Submission #479

Authors: [Your Name(s) Here]

Abstract

Vision foundation models have revolutionized 2D camera-based perception by extracting generalized features for downstream tasks. Recent work applies self-supervised cross-modal knowledge distillation (KD) to transfer these capabilities to 3D LiDAR models, but often relies on complex losses or pseudo-semantic maps. We introduce CleverDistiller, a self-supervised, cross-modal 2D-to-3D KD framework that introduces simple yet effective design choices. Our method uses a direct feature similarity loss and an MLP projection head to capture complex semantic dependencies without relying on pseudo-semantic maps or explicit semantic supervision. In addition, we improve the 3D spatial reasoning capabilities of learned representations through a self-supervised occupancy prediction task. Experiments on autonomous driving benchmarks show that CleverDistiller achieves state-of-the-art performance in both 3D semantic segmentation and 3D object detection, with up to 10% mIoU improvement, particularly when fine-tuning with limited data, demonstrating the effectiveness of our approach.

Core Contributions 💡

Our work introduces a simple yet powerful framework for distilling knowledge from 2D Vision Foundation Models (VFMs) to 3D LiDAR networks.

Simplified Distillation

We introduce a **non-linear MLP projection head**, allowing the model to learn more informative features without complex losses or semantic priors—a key limitation overlooked by previous methods.

Enhanced Spatial Reasoning

A **self-supervised occupancy prediction task** complements the semantic knowledge, encouraging the LiDAR model to learn crucial spatial and geometric information for more robust 3D representations.

SOTA Performance

Our method sets a new state-of-the-art, especially in **low-data regimes**, and demonstrates superior generalization and robustness across various datasets and under data corruption.

Our Method

CleverDistiller distills knowledge from a 2D VFM teacher to a 3D LiDAR student using a simple feature similarity loss, an MLP projection head, and an auxiliary spatial task.

CleverDistiller Pipeline Diagram from the paper

Key Results & Findings 📊

Our method achieves significant improvements in performance, robustness, and generalization across multiple benchmarks.

1. The Power of the MLP Projection Head

Our ablation studies clearly show that using an MLP projection head instead of a linear one is critical. The MLP leads to more informative 3D backbone features (measured by the **RankMe metric**), which directly correlates with better downstream performance.

A 3-layer MLP provided a ~4% mIoU boost on nuScenes (1% fine-tuning) compared to a simple linear projection.

Graph showing mIoU vs RankMe metric

2. State-of-the-Art Semantic Segmentation

CleverDistiller consistently outperforms previous methods. When fine-tuned on only **1% of nuScenes data**, our method achieves **59.8% mIoU**, a significant leap from the baseline ScaLR's 55.8% and other methods like SuperFlow's 48.1%.

The benefits hold when using larger teacher models and advanced 3D backbones, demonstrating the scalability of our approach.

Table of semantic segmentation results

3. Superior Robustness & Generalization

On the nuScenes-C benchmark, which introduces corruptions like fog and rain, CleverDistiller achieves the **lowest mean Corruption Error (mCE)** and **highest mean Resilience Rate (mRR)**.

The improvement is most dramatic in the cross-sensor scenario (extremely sparse LiDAR), where our model is up to 20% better than others.

Table of robustness results

4. Qualitative Improvements

Visually, our method produces cleaner and more spatially consistent semantic maps with fewer errors along object boundaries, which is critical for real-world applications.

Qualitative comparison between ScaLR and CleverDistiller

Conclusion

CleverDistiller demonstrates that complexity is not always the answer. Through two simple yet effective design choices—an **MLP projection head** and an **auxiliary occupancy task**—we significantly advance the state of the art in cross-modal 2D-to-3D knowledge distillation. Our method is not only simpler and more efficient but also produces more robust, generalizable, and powerful 3D representations.

Read the Full Paper (PDF)