Tan Sinan, Sima Kuankuan, Wang Dunzheng, Ge Mengmeng, Guo Di, Liu Huaping
IEEE Trans Neural Netw Learn Syst. 2025 Apr;36(4):6738-6751. doi: 10.1109/TNNLS.2024.3395633. Epub 2025 Apr 4.
In vision-and-language navigation (VLN) tasks, most current methods primarily utilize RGB images, overlooking the rich 3-D semantic data inherent to environments. To rectify this, we introduce a novel VLN framework that integrates 3-D semantic information into the navigation process. Our approach features a self-supervised training scheme that incorporates voxel-level 3-D semantic reconstruction to create a detailed 3-D semantic representation. A key component of this framework is a pretext task focused on region queries, which determines the presence of objects in specific 3-D areas. Following this, we devise an long short-term memory (LSTM)-based navigation model that is trained using our 3-D semantic representations. To maximize the utility of these 3-D semantic representations, we implement a cross-modal distillation strategy. This strategy encourages the RGB model's outputs to emulate those from the 3-D semantic feature network, enabling the concurrent training of both branches to merge RGB and 3-D semantic data effectively. Comprehensive evaluations on both the R2R and R4R datasets reveal that our method significantly enhances performance in VLN tasks.
在视觉与语言导航(VLN)任务中,当前大多数方法主要利用RGB图像,而忽略了环境中固有的丰富三维语义数据。为了纠正这一点,我们引入了一种新颖的VLN框架,该框架将三维语义信息集成到导航过程中。我们的方法具有一种自监督训练方案,该方案结合体素级三维语义重建来创建详细的三维语义表示。该框架的一个关键组件是一个专注于区域查询的前置任务,它确定特定三维区域中物体的存在。在此之后,我们设计了一个基于长短期记忆(LSTM)的导航模型,该模型使用我们的三维语义表示进行训练。为了最大限度地利用这些三维语义表示,我们实施了一种跨模态蒸馏策略。该策略鼓励RGB模型的输出模仿三维语义特征网络的输出,从而能够同时训练两个分支,有效地融合RGB和三维语义数据。在R2R和R4R数据集上的综合评估表明,我们的方法显著提高了VLN任务的性能。