• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

LangLoc:通过格式化空间描述生成实现语言驱动的本地化

LangLoc: Language-Driven Localization via Formatted Spatial Description Generation.

作者信息

Shi Weimin, Chen Changhao, Li Kaige, Xiong Yuan, Cao Xiaochun, Zhou Zhong

出版信息

IEEE Trans Image Process. 2025;34:1737-1752. doi: 10.1109/TIP.2025.3546853. Epub 2025 Mar 20.

DOI:10.1109/TIP.2025.3546853
PMID:40067730
Abstract

Existing localization methods commonly employ vision to perceive scene and achieve localization in GNSS-denied areas, yet they often struggle in environments with complex lighting conditions, dynamic objects or privacy-preserving areas. Humans possess the ability to describe various scenes using natural language, effectively inferring their location by leveraging the rich semantic information in these descriptions. Harnessing language presents a potential solution for robust localization. Thus, this study introduces a new task, Language-driven Localization, and proposes a novel localization framework, LangLoc, which determines the user's position and orientation through textual descriptions. Given the diversity of natural language descriptions, we first design a Spatial Description Generator (SDG), foundational to LangLoc, which extracts and combines the position and attribute information of objects within a scene to generate uniformly formatted textual descriptions. SDG eliminates the ambiguity of language, detailing the spatial layout and object relations of the scene, providing a reliable basis for localization. With generated descriptions, LangLoc effortlessly achieves language-only localization using text encoder and pose regressor. Furthermore, LangLoc can add one image to text input, achieving mutual optimization and feature adaptive fusion across modalities through two modality-specific encoders, cross-modal fusion, and multimodal joint learning strategies. This enhances the framework's capability to handle complex scenes, achieving more accurate localization. Extensive experiments on the Oxford RobotCar, 4-Seasons, and Virtual Gallery datasets demonstrate LangLoc's effectiveness in both language-only and visual-language localization across various outdoor and indoor scenarios. Notably, LangLoc achieves noticeable performance gains when using both text and image inputs in challenging conditions such as overexposure, low lighting, and occlusions, showcasing its superior robustness.

摘要

现有的定位方法通常利用视觉来感知场景并在全球导航卫星系统(GNSS)信号缺失的区域实现定位,然而,它们在光照条件复杂、存在动态物体或需要保护隐私的区域往往会遇到困难。人类具备使用自然语言描述各种场景的能力,通过利用这些描述中的丰富语义信息有效地推断自己的位置。利用语言为可靠的定位提供了一种潜在的解决方案。因此,本研究引入了一项新任务——语言驱动定位,并提出了一种新颖的定位框架LangLoc,该框架通过文本描述来确定用户的位置和方向。鉴于自然语言描述的多样性,我们首先设计了一个空间描述生成器(SDG),它是LangLoc的基础,用于提取和组合场景中物体的位置和属性信息,以生成格式统一的文本描述。SDG消除了语言的模糊性,详细说明了场景的空间布局和物体关系,为定位提供了可靠的基础。利用生成的描述,LangLoc使用文本编码器和姿态回归器轻松实现仅基于语言的定位。此外,LangLoc可以在文本输入中添加一张图像,通过两个特定模态的编码器、跨模态融合和多模态联合学习策略,实现跨模态的相互优化和特征自适应融合。这增强了框架处理复杂场景的能力,实现更精确的定位。在牛津机器人汽车、四季和虚拟画廊数据集上进行的大量实验表明,LangLoc在各种室外和室内场景的仅语言定位和视觉语言定位中均有效。值得注意的是,在诸如过度曝光、低光照和遮挡等具有挑战性的条件下,当同时使用文本和图像输入时,LangLoc取得了显著的性能提升,展示了其卓越的鲁棒性。

相似文献

1
LangLoc: Language-Driven Localization via Formatted Spatial Description Generation.LangLoc:通过格式化空间描述生成实现语言驱动的本地化
IEEE Trans Image Process. 2025;34:1737-1752. doi: 10.1109/TIP.2025.3546853. Epub 2025 Mar 20.
2
Unambiguous Scene Text Segmentation with Referring Expression Comprehension.结合指代表达理解的明确场景文本分割
IEEE Trans Image Process. 2019 Jul 26. doi: 10.1109/TIP.2019.2930176.
3
CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model.CLIP-Llama:一种结合预训练视觉语言模型和预训练语言模型进行场景文本识别的新方法。
Sensors (Basel). 2024 Nov 19;24(22):7371. doi: 10.3390/s24227371.
4
A Multi-Modal Foundation Model to Assist People with Blindness and Low Vision in Environmental Interaction.一种多模态基础模型,用于协助失明和视力低下者进行环境交互。
J Imaging. 2024 Apr 26;10(5):103. doi: 10.3390/jimaging10050103.
5
Vision-Language Model for Generating Textual Descriptions From Clinical Images: Model Development and Validation Study.用于从临床图像生成文本描述的视觉语言模型:模型开发与验证研究
JMIR Form Res. 2024 Feb 8;8:e32690. doi: 10.2196/32690.
6
Advanced Monocular Outdoor Pose Estimation in Autonomous Systems: Leveraging Optical Flow, Depth Estimation, and Semantic Segmentation with Dynamic Object Removal.自主系统中的高级单目户外姿态估计:利用光流、深度估计和语义分割去除动态物体
Sensors (Basel). 2024 Dec 17;24(24):8040. doi: 10.3390/s24248040.
7
Indoor Scene Recognition Mechanism Based on Direction-Driven Convolutional Neural Networks.基于方向驱动卷积神经网络的室内场景识别机制
Sensors (Basel). 2023 Jun 17;23(12):5672. doi: 10.3390/s23125672.
8
Bidirectional Relationship Inferring Network for Referring Image Localization and Segmentation.用于指称图像定位与分割的双向关系推理网络
IEEE Trans Neural Netw Learn Syst. 2023 May;34(5):2246-2258. doi: 10.1109/TNNLS.2021.3106153. Epub 2023 May 2.
9
Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding.Lowis3D:语言驱动的开放世界实例级3D场景理解
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8517-8533. doi: 10.1109/TPAMI.2024.3410324. Epub 2024 Nov 6.
10
MMAgentRec, a personalized multi-modal recommendation agent with large language model.MMAgentRec,一个带有大语言模型的个性化多模态推荐代理。
Sci Rep. 2025 Apr 8;15(1):12062. doi: 10.1038/s41598-025-96458-w.