• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于视觉与语言导航的自监督3D语义表征学习

Self-Supervised 3-D Semantic Representation Learning for Vision-and-Language Navigation.

作者信息

Tan Sinan, Sima Kuankuan, Wang Dunzheng, Ge Mengmeng, Guo Di, Liu Huaping

出版信息

IEEE Trans Neural Netw Learn Syst. 2025 Apr;36(4):6738-6751. doi: 10.1109/TNNLS.2024.3395633. Epub 2025 Apr 4.

DOI:10.1109/TNNLS.2024.3395633
PMID:38743539
Abstract

In vision-and-language navigation (VLN) tasks, most current methods primarily utilize RGB images, overlooking the rich 3-D semantic data inherent to environments. To rectify this, we introduce a novel VLN framework that integrates 3-D semantic information into the navigation process. Our approach features a self-supervised training scheme that incorporates voxel-level 3-D semantic reconstruction to create a detailed 3-D semantic representation. A key component of this framework is a pretext task focused on region queries, which determines the presence of objects in specific 3-D areas. Following this, we devise an long short-term memory (LSTM)-based navigation model that is trained using our 3-D semantic representations. To maximize the utility of these 3-D semantic representations, we implement a cross-modal distillation strategy. This strategy encourages the RGB model's outputs to emulate those from the 3-D semantic feature network, enabling the concurrent training of both branches to merge RGB and 3-D semantic data effectively. Comprehensive evaluations on both the R2R and R4R datasets reveal that our method significantly enhances performance in VLN tasks.

摘要

在视觉与语言导航(VLN)任务中,当前大多数方法主要利用RGB图像,而忽略了环境中固有的丰富三维语义数据。为了纠正这一点,我们引入了一种新颖的VLN框架,该框架将三维语义信息集成到导航过程中。我们的方法具有一种自监督训练方案,该方案结合体素级三维语义重建来创建详细的三维语义表示。该框架的一个关键组件是一个专注于区域查询的前置任务,它确定特定三维区域中物体的存在。在此之后,我们设计了一个基于长短期记忆(LSTM)的导航模型,该模型使用我们的三维语义表示进行训练。为了最大限度地利用这些三维语义表示,我们实施了一种跨模态蒸馏策略。该策略鼓励RGB模型的输出模仿三维语义特征网络的输出,从而能够同时训练两个分支,有效地融合RGB和三维语义数据。在R2R和R4R数据集上的综合评估表明,我们的方法显著提高了VLN任务的性能。

相似文献

1
Self-Supervised 3-D Semantic Representation Learning for Vision-and-Language Navigation.用于视觉与语言导航的自监督3D语义表征学习
IEEE Trans Neural Netw Learn Syst. 2025 Apr;36(4):6738-6751. doi: 10.1109/TNNLS.2024.3395633. Epub 2025 Apr 4.
2
PanoGen++: Domain-adapted text-guided panoramic environment generation for vision-and-language navigation.
Neural Netw. 2025 Jul;187:107320. doi: 10.1016/j.neunet.2025.107320. Epub 2025 Mar 10.
3
Correctable Landmark Discovery via Large Models for Vision-Language Navigation.通过大型模型进行视觉语言导航的可校正地标发现
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8534-8548. doi: 10.1109/TPAMI.2024.3407759. Epub 2024 Nov 6.
4
DICCR: Double-gated intervention and confounder causal reasoning for vision-language navigation.DICCR:用于视觉语言导航的双门干预与混杂因素因果推理
Neural Netw. 2025 Apr;184:107078. doi: 10.1016/j.neunet.2024.107078. Epub 2024 Dec 30.
5
Towards Deviation-Robust Agent Navigation via Perturbation-Aware Contrastive Learning.通过扰动感知对比学习实现抗偏差智能体导航
IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):12535-12549. doi: 10.1109/TPAMI.2023.3273594. Epub 2023 Sep 5.
6
HOP+: History-Enhanced and Order-Aware Pre-Training for Vision-and-Language Navigation.HOP+:用于视觉语言导航的具有历史增强和顺序感知的预训练。
IEEE Trans Pattern Anal Mach Intell. 2023 Jul;45(7):8524-8537. doi: 10.1109/TPAMI.2023.3234243. Epub 2023 Jun 5.
7
RDCRNet: RGB-T Object Detection Network Based on Cross-Modal Representation Model.RDCRNet:基于跨模态表示模型的RGB-T目标检测网络
Entropy (Basel). 2025 Apr 19;27(4):442. doi: 10.3390/e27040442.
8
NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning.NavCoT:通过学习解缠推理提升基于大语言模型的视觉与语言导航能力
IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5945-5957. doi: 10.1109/TPAMI.2025.3554559.
9
MiLNet: Multiplex Interactive Learning Network for RGB-T Semantic Segmentation.MiLNet:用于RGB-T语义分割的多路复用交互式学习网络
IEEE Trans Image Process. 2025;34:1686-1699. doi: 10.1109/TIP.2025.3544484. Epub 2025 Mar 11.
10
A Depth Awareness and Learnable Feature Fusion Network for Enhanced Geometric Perception in Semantic Correspondence.一种用于增强语义对应中几何感知的深度感知与可学习特征融合网络。
Sensors (Basel). 2024 Oct 17;24(20):6680. doi: 10.3390/s24206680.