VideoA11y：无障碍视频描述的方法与数据集。

VideoA11y: Method and Dataset for Accessible Video Description.

作者信息

Li Chaoyu, Padmanabhuni Sid, Cheema Maryam S, Seifi Hasti, Fazli Pooyan

机构信息

School of Computing and Augmented Intelligence, Arizona State University, Tempe, Arizona, USA.

School of Arts, Media and Engineering, Arizona State University, Tempe, Arizona, USA.

出版信息

Proc SIGCHI Conf Hum Factor Comput Syst. 2025 Apr-May;2025. doi: 10.1145/3706598.3714096. Epub 2025 Apr 25.

DOI:10.1145/3706598.3714096

PMID:40894856

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12398407/

Abstract

Video descriptions are crucial for blind and low vision (BLV) users to access visual content. However, current artificial intelligence models for generating descriptions often fall short due to limitations in the quality of human annotations within training datasets, resulting in descriptions that do not fully meet BLV users' needs. To address this gap, we introduce VideoA11y, an approach that leverages multimodal large language models (MLLMs) and video accessibility guidelines to generate descriptions tailored for BLV individuals. Using this method, we have curated VideoA11y-40K, the largest and most comprehensive dataset of 40,000 videos described for BLV users. Rigorous experiments across 15 video categories, involving 347 sighted participants, 40 BLV participants, and seven professional describers, showed that VideoA11y descriptions outperform novice human annotations and are comparable to trained human annotations in clarity, accuracy, objectivity, descriptiveness, and user satisfaction. We evaluated models on VideoA11y-40K using both standard and custom metrics, demonstrating that MLLMs fine-tuned on this dataset produce high-quality accessible descriptions. Code and dataset are available at https://people-robots.github.io/VideoA11y/.

摘要

视频描述对于盲人及低视力（BLV）用户访问视觉内容至关重要。然而，由于训练数据集中人类标注质量的限制，当前用于生成描述的人工智能模型往往存在不足，导致生成的描述不能完全满足BLV用户的需求。为了弥补这一差距，我们引入了VideoA11y，这是一种利用多模态大语言模型（MLLM）和视频无障碍指南为BLV个体生成定制描述的方法。使用这种方法，我们精心整理了VideoA11y - 40K，这是为BLV用户描述的40000个视频组成的最大且最全面的数据集。在15个视频类别上进行的严格实验，涉及347名视力正常的参与者、40名BLV参与者和7名专业描述者，结果表明VideoA11y生成的描述优于新手人类标注，并且在清晰度、准确性、客观性、描述性和用户满意度方面与经过训练的人类标注相当。我们使用标准和自定义指标在VideoA11y - 40K上对模型进行了评估，证明在此数据集上微调的MLLM能够生成高质量的无障碍描述。代码和数据集可在https://people - robots.github.io/VideoA11y/获取。