用于视觉与语言导航的自监督3D语义表征学习

Self-Supervised 3-D Semantic Representation Learning for Vision-and-Language Navigation.

作者信息

Tan Sinan, Sima Kuankuan, Wang Dunzheng, Ge Mengmeng, Guo Di, Liu Huaping

出版信息

IEEE Trans Neural Netw Learn Syst. 2025 Apr;36(4):6738-6751. doi: 10.1109/TNNLS.2024.3395633. Epub 2025 Apr 4.

DOI:10.1109/TNNLS.2024.3395633

Abstract

In vision-and-language navigation (VLN) tasks, most current methods primarily utilize RGB images, overlooking the rich 3-D semantic data inherent to environments. To rectify this, we introduce a novel VLN framework that integrates 3-D semantic information into the navigation process. Our approach features a self-supervised training scheme that incorporates voxel-level 3-D semantic reconstruction to create a detailed 3-D semantic representation. A key component of this framework is a pretext task focused on region queries, which determines the presence of objects in specific 3-D areas. Following this, we devise an long short-term memory (LSTM)-based navigation model that is trained using our 3-D semantic representations. To maximize the utility of these 3-D semantic representations, we implement a cross-modal distillation strategy. This strategy encourages the RGB model's outputs to emulate those from the 3-D semantic feature network, enabling the concurrent training of both branches to merge RGB and 3-D semantic data effectively. Comprehensive evaluations on both the R2R and R4R datasets reveal that our method significantly enhances performance in VLN tasks.

摘要

在视觉与语言导航（VLN）任务中，当前大多数方法主要利用RGB图像，而忽略了环境中固有的丰富三维语义数据。为了纠正这一点，我们引入了一种新颖的VLN框架，该框架将三维语义信息集成到导航过程中。我们的方法具有一种自监督训练方案，该方案结合体素级三维语义重建来创建详细的三维语义表示。该框架的一个关键组件是一个专注于区域查询的前置任务，它确定特定三维区域中物体的存在。在此之后，我们设计了一个基于长短期记忆（LSTM）的导航模型，该模型使用我们的三维语义表示进行训练。为了最大限度地利用这些三维语义表示，我们实施了一种跨模态蒸馏策略。该策略鼓励RGB模型的输出模仿三维语义特征网络的输出，从而能够同时训练两个分支，有效地融合RGB和三维语义数据。在R2R和R4R数据集上的综合评估表明，我们的方法显著提高了VLN任务的性能。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于视觉与语言导航的自监督3D语义表征学习

Self-Supervised 3-D Semantic Representation Learning for Vision-and-Language Navigation.

作者信息

出版信息

相似文献

用于视觉与语言导航的自监督3D语义表征学习

Self-Supervised 3-D Semantic Representation Learning for Vision-and-Language Navigation.

作者信息

出版信息

相似文献