Shi Weimin, Chen Changhao, Li Kaige, Xiong Yuan, Cao Xiaochun, Zhou Zhong
IEEE Trans Image Process. 2025;34:1737-1752. doi: 10.1109/TIP.2025.3546853. Epub 2025 Mar 20.
Existing localization methods commonly employ vision to perceive scene and achieve localization in GNSS-denied areas, yet they often struggle in environments with complex lighting conditions, dynamic objects or privacy-preserving areas. Humans possess the ability to describe various scenes using natural language, effectively inferring their location by leveraging the rich semantic information in these descriptions. Harnessing language presents a potential solution for robust localization. Thus, this study introduces a new task, Language-driven Localization, and proposes a novel localization framework, LangLoc, which determines the user's position and orientation through textual descriptions. Given the diversity of natural language descriptions, we first design a Spatial Description Generator (SDG), foundational to LangLoc, which extracts and combines the position and attribute information of objects within a scene to generate uniformly formatted textual descriptions. SDG eliminates the ambiguity of language, detailing the spatial layout and object relations of the scene, providing a reliable basis for localization. With generated descriptions, LangLoc effortlessly achieves language-only localization using text encoder and pose regressor. Furthermore, LangLoc can add one image to text input, achieving mutual optimization and feature adaptive fusion across modalities through two modality-specific encoders, cross-modal fusion, and multimodal joint learning strategies. This enhances the framework's capability to handle complex scenes, achieving more accurate localization. Extensive experiments on the Oxford RobotCar, 4-Seasons, and Virtual Gallery datasets demonstrate LangLoc's effectiveness in both language-only and visual-language localization across various outdoor and indoor scenarios. Notably, LangLoc achieves noticeable performance gains when using both text and image inputs in challenging conditions such as overexposure, low lighting, and occlusions, showcasing its superior robustness.
现有的定位方法通常利用视觉来感知场景并在全球导航卫星系统(GNSS)信号缺失的区域实现定位,然而,它们在光照条件复杂、存在动态物体或需要保护隐私的区域往往会遇到困难。人类具备使用自然语言描述各种场景的能力,通过利用这些描述中的丰富语义信息有效地推断自己的位置。利用语言为可靠的定位提供了一种潜在的解决方案。因此,本研究引入了一项新任务——语言驱动定位,并提出了一种新颖的定位框架LangLoc,该框架通过文本描述来确定用户的位置和方向。鉴于自然语言描述的多样性,我们首先设计了一个空间描述生成器(SDG),它是LangLoc的基础,用于提取和组合场景中物体的位置和属性信息,以生成格式统一的文本描述。SDG消除了语言的模糊性,详细说明了场景的空间布局和物体关系,为定位提供了可靠的基础。利用生成的描述,LangLoc使用文本编码器和姿态回归器轻松实现仅基于语言的定位。此外,LangLoc可以在文本输入中添加一张图像,通过两个特定模态的编码器、跨模态融合和多模态联合学习策略,实现跨模态的相互优化和特征自适应融合。这增强了框架处理复杂场景的能力,实现更精确的定位。在牛津机器人汽车、四季和虚拟画廊数据集上进行的大量实验表明,LangLoc在各种室外和室内场景的仅语言定位和视觉语言定位中均有效。值得注意的是,在诸如过度曝光、低光照和遮挡等具有挑战性的条件下,当同时使用文本和图像输入时,LangLoc取得了显著的性能提升,展示了其卓越的鲁棒性。