一种多模态基础模型，用于协助失明和视力低下者进行环境交互。

A Multi-Modal Foundation Model to Assist People with Blindness and Low Vision in Environmental Interaction.

作者信息

Hao Yu, Yang Fan, Huang Hao, Yuan Shuaihang, Rangan Sundeep, Rizzo John-Ross, Wang Yao, Fang Yi

机构信息

Tandon School of Engineering, New York University, Brooklyn, NY 11201, USA.

NYU Langone Health, New York University, New York, NY 10016, USA.

出版信息

J Imaging. 2024 Apr 26;10(5):103. doi: 10.3390/jimaging10050103.

DOI:10.3390/jimaging10050103

PMID:38786557

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11122237/

Abstract

People with blindness and low vision (pBLV) encounter substantial challenges when it comes to comprehensive scene recognition and precise object identification in unfamiliar environments. Additionally, due to the vision loss, pBLV have difficulty in accessing and identifying potential tripping hazards independently. Previous assistive technologies for the visually impaired often struggle in real-world scenarios due to the need for constant training and lack of robustness, which limits their effectiveness, especially in dynamic and unfamiliar environments, where accurate and efficient perception is crucial. Therefore, we frame our research question in this paper as: We hypothesize that by leveraging large pretrained foundation models and prompt engineering, we can create a system that effectively addresses the challenges faced by pBLV in unfamiliar environments. Motivated by the prevalence of large pretrained foundation models, particularly in assistive robotics applications, due to their accurate perception and robust contextual understanding in real-world scenarios induced by extensive pretraining, we present a pioneering approach that leverages foundation models to enhance visual perception for pBLV, offering detailed and comprehensive descriptions of the surrounding environment and providing warnings about potential risks. Specifically, our method begins by leveraging a large-image tagging model (i.e., Recognize Anything Model (RAM)) to identify all common objects present in the captured images. The recognition results and user query are then integrated into a prompt, tailored specifically for pBLV, using prompt engineering. By combining the prompt and input image, a vision-language foundation model (i.e., InstructBLIP) generates detailed and comprehensive descriptions of the environment and identifies potential risks in the environment by analyzing environmental objects and scenic landmarks, relevant to the prompt. We evaluate our approach through experiments conducted on both indoor and outdoor datasets. Our results demonstrate that our method can recognize objects accurately and provide insightful descriptions and analysis of the environment for pBLV.

摘要

失明和视力低下的人（pBLV）在不熟悉的环境中进行全面场景识别和精确物体识别时会遇到重大挑战。此外，由于视力丧失，pBLV在独立获取和识别潜在绊倒危险方面存在困难。以前用于视障人士的辅助技术在现实世界场景中往往存在困难，因为需要持续训练且缺乏鲁棒性，这限制了它们的有效性，尤其是在动态和不熟悉的环境中，准确高效的感知至关重要。因此，我们在本文中提出的研究问题是：我们假设通过利用大型预训练基础模型和提示工程，我们可以创建一个系统，有效解决pBLV在不熟悉环境中面临的挑战。受大型预训练基础模型普及的推动，特别是在辅助机器人应用中，由于它们在广泛预训练所诱导的现实世界场景中具有准确的感知和强大的上下文理解能力，我们提出了一种开创性的方法，利用基础模型增强pBLV的视觉感知，提供周围环境的详细全面描述，并对潜在风险发出警告。具体来说，我们的方法首先利用一个大型图像标记模型（即识别一切模型（RAM））来识别捕获图像中存在的所有常见物体。然后，使用提示工程将识别结果和用户查询整合到一个专门为pBLV量身定制的提示中。通过将提示与输入图像相结合，一个视觉语言基础模型（即InstructBLIP）生成环境的详细全面描述，并通过分析与提示相关的环境物体和风景地标来识别环境中的潜在风险。我们通过在室内和室外数据集上进行的实验来评估我们的方法。我们的结果表明，我们的方法可以准确识别物体，并为pBLV提供有洞察力的环境描述和分析。