Suppr超能文献

VideoA11y:无障碍视频描述的方法与数据集。

VideoA11y: Method and Dataset for Accessible Video Description.

作者信息

Li Chaoyu, Padmanabhuni Sid, Cheema Maryam S, Seifi Hasti, Fazli Pooyan

机构信息

School of Computing and Augmented Intelligence, Arizona State University, Tempe, Arizona, USA.

School of Arts, Media and Engineering, Arizona State University, Tempe, Arizona, USA.

出版信息

Proc SIGCHI Conf Hum Factor Comput Syst. 2025 Apr-May;2025. doi: 10.1145/3706598.3714096. Epub 2025 Apr 25.

Abstract

Video descriptions are crucial for blind and low vision (BLV) users to access visual content. However, current artificial intelligence models for generating descriptions often fall short due to limitations in the quality of human annotations within training datasets, resulting in descriptions that do not fully meet BLV users' needs. To address this gap, we introduce VideoA11y, an approach that leverages multimodal large language models (MLLMs) and video accessibility guidelines to generate descriptions tailored for BLV individuals. Using this method, we have curated VideoA11y-40K, the largest and most comprehensive dataset of 40,000 videos described for BLV users. Rigorous experiments across 15 video categories, involving 347 sighted participants, 40 BLV participants, and seven professional describers, showed that VideoA11y descriptions outperform novice human annotations and are comparable to trained human annotations in clarity, accuracy, objectivity, descriptiveness, and user satisfaction. We evaluated models on VideoA11y-40K using both standard and custom metrics, demonstrating that MLLMs fine-tuned on this dataset produce high-quality accessible descriptions. Code and dataset are available at https://people-robots.github.io/VideoA11y/.

摘要

视频描述对于盲人及低视力(BLV)用户访问视觉内容至关重要。然而,由于训练数据集中人类标注质量的限制,当前用于生成描述的人工智能模型往往存在不足,导致生成的描述不能完全满足BLV用户的需求。为了弥补这一差距,我们引入了VideoA11y,这是一种利用多模态大语言模型(MLLM)和视频无障碍指南为BLV个体生成定制描述的方法。使用这种方法,我们精心整理了VideoA11y - 40K,这是为BLV用户描述的40000个视频组成的最大且最全面的数据集。在15个视频类别上进行的严格实验,涉及347名视力正常的参与者、40名BLV参与者和7名专业描述者,结果表明VideoA11y生成的描述优于新手人类标注,并且在清晰度、准确性、客观性、描述性和用户满意度方面与经过训练的人类标注相当。我们使用标准和自定义指标在VideoA11y - 40K上对模型进行了评估,证明在此数据集上微调的MLLM能够生成高质量的无障碍描述。代码和数据集可在https://people - robots.github.io/VideoA11y/获取。

相似文献

1
VideoA11y: Method and Dataset for Accessible Video Description.VideoA11y:无障碍视频描述的方法与数据集。
Proc SIGCHI Conf Hum Factor Comput Syst. 2025 Apr-May;2025. doi: 10.1145/3706598.3714096. Epub 2025 Apr 25.
6
Shared decision-making interventions for people with mental health conditions.心理健康问题患者的共同决策干预措施。
Cochrane Database Syst Rev. 2022 Nov 11;11(11):CD007297. doi: 10.1002/14651858.CD007297.pub3.

引用本文的文献

1

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验