Fernandes Daniel L, Ribeiro Marcos H F, Silva Michel M, Cerqueira Fabio R
Department of Informatics, Universidade Federal de Viçosa - UFV, Viçosa, Brazil.
Department of Production Engineering, Universidade Federal Fluminense - UFF, Petrópolis, Brazil.
Disabil Rehabil Assist Technol. 2025 Jul;20(5):1470-1495. doi: 10.1080/17483107.2024.2437567. Epub 2024 Dec 11.
Existing image description methods when used as Assistive Technologies often fall short in meeting the needs of blind or low vision (BLV) individuals. They tend to either compress all visual elements into brief captions, create disjointed sentences for each image region, or provide extensive descriptions.
To address these limitations, we introduce VIIDA, a procedure aimed at the Visually Impaired which implements an Image Description Approach, focusing on webinar scenes. We also propose InViDe, an Inclusive Visual Description metric, a novel approach for evaluating image descriptions targeting BLV people.
We reviewed existing methods and developed VIIDA by integrating a multimodal Visual Question Answering model with Natural Language Processing (NLP) filters. A scene graph-based algorithm was then applied to structure final paragraphs. By employing NLP tools, InViDe conducts a multicriteria analysis based on accessibility standards and guidelines.
Experiments statistically demonstrate that VIIDA generates descriptions closely aligned with image content as well as human-written linguistic features, and that suit BLV needs. InViDe offers valuable insights into the behaviour of the compared methods - among them, state-of-the-art methods based on Large Language Models - across diverse criteria.
VIIDA and InViDe emerge as efficient Assistive Technologies, combining Artificial Intelligence models and computational/mathematical techniques to generate and evaluate image descriptions for the visually impaired with low computational costs. This work is anticipated to inspire further research and application development in the domain of Assistive Technologies. Our codes are publicly available at: https://github.com/daniellf/VIIDA-and-InViDe.
现有的图像描述方法在用作辅助技术时,往往无法满足盲人或低视力(BLV)个体的需求。它们要么将所有视觉元素压缩成简短的字幕,为每个图像区域创建不连贯的句子,要么提供冗长的描述。
为了解决这些局限性,我们引入了VIIDA,这是一种针对视障人士的程序,它实现了一种图像描述方法,重点关注网络研讨会场景。我们还提出了InViDe,一种包容性视觉描述指标,这是一种评估针对BLV人群的图像描述的新方法。
我们回顾了现有方法,并通过将多模态视觉问答模型与自然语言处理(NLP)过滤器集成来开发VIIDA。然后应用基于场景图的算法来构建最终段落。通过使用NLP工具,InViDe根据无障碍标准和指南进行多标准分析。
实验从统计学上证明,VIIDA生成的描述与图像内容以及人工编写的语言特征紧密对齐,并且适合BLV的需求。InViDe为比较方法(包括基于大语言模型的现有技术方法)在不同标准下的表现提供了有价值的见解。
VIIDA和InViDe成为高效的辅助技术,结合人工智能模型和计算/数学技术,以低计算成本为视障人士生成和评估图像描述。这项工作有望激发辅助技术领域的进一步研究和应用开发。我们的代码可在以下网址公开获取:https://github.com/daniellf/VIIDA-and-InViDe。