Sun Tianfang, Zhang Zhizhong, Tan Xin, Peng Yong, Qu Yanyun, Xie Yuan
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):11059-11072. doi: 10.1109/TPAMI.2024.3451658. Epub 2024 Nov 6.
Combining LiDAR points and images for robust semantic segmentation has shown great potential. However, the heterogeneity between the two modalities (e.g. the density, the field of view) poses challenges in establishing a bijective mapping between each point and pixel. This modality alignment problem introduces new challenges in network design and data processing for cross-modal methods. Specifically, 1) points that are projected outside the image planes; 2) the complexity of maintaining geometric consistency limits the deployment of many data augmentation techniques. To address these challenges, we propose a cross-modal knowledge imputation and transition approach. First, we introduce a bidirectional feature fusion strategy that imputes missing image features and performs cross-modal fusion simultaneously. This allows us to generate reliable predictions even when images are missing. Second, we propose a Uni-to-Multi modal Knowledge Distillation (U2MKD) framework, leveraging the transfer of informative features from a single-modality teacher to a cross-modality student. This overcomes the issues of augmentation misalignment and enables us to train the student effectively. Extensive experiments on the nuScenes, Waymo, and SemanticKITTI datasets demonstrate the effectiveness of our approach. Notably, our method achieves an 8.3 mIoU gain over the LiDAR-only baseline on the nuScenes validation set and achieves state-of-the-art performance on the three datasets.
将激光雷达点云和图像相结合以进行稳健的语义分割已显示出巨大潜力。然而,这两种模态之间的异质性(例如密度、视野)在建立每个点与像素之间的双射映射时带来了挑战。这种模态对齐问题在跨模态方法的网络设计和数据处理中引入了新的挑战。具体而言,1)投影到图像平面之外的点;2)维持几何一致性的复杂性限制了许多数据增强技术的应用。为应对这些挑战,我们提出了一种跨模态知识插补与转换方法。首先,我们引入了一种双向特征融合策略,该策略可插补缺失的图像特征并同时执行跨模态融合。这使我们即使在图像缺失时也能生成可靠的预测。其次,我们提出了一个单模态到多模态知识蒸馏(U2MKD)框架,利用从单模态教师到跨模态学生的信息性特征转移。这克服了增强对齐问题,并使我们能够有效地训练学生。在nuScenes、Waymo和SemanticKITTI数据集上进行的大量实验证明了我们方法的有效性。值得注意的是,我们的方法在nuScenes验证集上比仅使用激光雷达的基线提高了8.3的平均交并比,并在这三个数据集上取得了领先的性能。