CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation
BMVC 2025 Submission #479
Authors: [Your Name(s) Here]
Abstract
Vision foundation models have revolutionized 2D camera-based perception by extracting generalized features for downstream tasks. Recent work applies self-supervised cross-modal knowledge distillation (KD) to transfer these capabilities to 3D LiDAR models, but often relies on complex losses or pseudo-semantic maps. We introduce CleverDistiller, a self-supervised, cross-modal 2D-to-3D KD framework that introduces simple yet effective design choices. Our method uses a direct feature similarity loss and an MLP projection head to capture complex semantic dependencies without relying on pseudo-semantic maps or explicit semantic supervision. In addition, we improve the 3D spatial reasoning capabilities of learned representations through a self-supervised occupancy prediction task. Experiments on autonomous driving benchmarks show that CleverDistiller achieves state-of-the-art performance in both 3D semantic segmentation and 3D object detection, with up to 10% mIoU improvement, particularly when fine-tuning with limited data, demonstrating the effectiveness of our approach.
Core Contributions 💡
Our work introduces a simple yet powerful framework for distilling knowledge from 2D Vision Foundation Models (VFMs) to 3D LiDAR networks.
Simplified Distillation
We introduce a **non-linear MLP projection head**, allowing the model to learn more informative features without complex losses or semantic priors—a key limitation overlooked by previous methods.
Enhanced Spatial Reasoning
A **self-supervised occupancy prediction task** complements the semantic knowledge, encouraging the LiDAR model to learn crucial spatial and geometric information for more robust 3D representations.
SOTA Performance
Our method sets a new state-of-the-art, especially in **low-data regimes**, and demonstrates superior generalization and robustness across various datasets and under data corruption.
Our Method
CleverDistiller distills knowledge from a 2D VFM teacher to a 3D LiDAR student using a simple feature similarity loss, an MLP projection head, and an auxiliary spatial task.
Key Results & Findings 📊
Our method achieves significant improvements in performance, robustness, and generalization across multiple benchmarks.
1. The Power of the MLP Projection Head
Our ablation studies clearly show that using an MLP projection head instead of a linear one is critical. The MLP leads to more informative 3D backbone features (measured by the **RankMe metric**), which directly correlates with better downstream performance.
A 3-layer MLP provided a ~4% mIoU boost on nuScenes (1% fine-tuning) compared to a simple linear projection.
2. State-of-the-Art Semantic Segmentation
CleverDistiller consistently outperforms previous methods. When fine-tuned on only **1% of nuScenes data**, our method achieves **59.8% mIoU**, a significant leap from the baseline ScaLR's 55.8% and other methods like SuperFlow's 48.1%.
The benefits hold when using larger teacher models and advanced 3D backbones, demonstrating the scalability of our approach.
3. Superior Robustness & Generalization
On the nuScenes-C benchmark, which introduces corruptions like fog and rain, CleverDistiller achieves the **lowest mean Corruption Error (mCE)** and **highest mean Resilience Rate (mRR)**.
The improvement is most dramatic in the cross-sensor scenario (extremely sparse LiDAR), where our model is up to 20% better than others.
4. Qualitative Improvements
Visually, our method produces cleaner and more spatially consistent semantic maps with fewer errors along object boundaries, which is critical for real-world applications.
Conclusion
CleverDistiller demonstrates that complexity is not always the answer. Through two simple yet effective design choices—an **MLP projection head** and an **auxiliary occupancy task**—we significantly advance the state of the art in cross-modal 2D-to-3D knowledge distillation. Our method is not only simpler and more efficient but also produces more robust, generalizable, and powerful 3D representations.
Read the Full Paper (PDF)