• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

相似文献

1
VideoA11y: Method and Dataset for Accessible Video Description.VideoA11y:无障碍视频描述的方法与数据集。
Proc SIGCHI Conf Hum Factor Comput Syst. 2025 Apr-May;2025. doi: 10.1145/3706598.3714096. Epub 2025 Apr 25.
2
VIIDA and InViDe: computational approaches for generating and evaluating inclusive image paragraphs for the visually impaired.VIIDA和InViDe:为视障人士生成和评估包容性图像段落的计算方法。
Disabil Rehabil Assist Technol. 2025 Jul;20(5):1470-1495. doi: 10.1080/17483107.2024.2437567. Epub 2024 Dec 11.
3
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
4
Exploring the use of smartphone applications during navigation-based tasks for individuals who are blind or who have low vision: future directions and priorities.探索为盲人或视力低下者在基于导航的任务中使用智能手机应用程序:未来方向与优先事项。
Disabil Rehabil Assist Technol. 2025 Aug 25:1-29. doi: 10.1080/17483107.2025.2544942.
5
Development of a personalized conversational health agent to enhance physical activity for blind and low-vision individuals.开发个性化对话式健康代理以增强盲人和低视力个体的身体活动。
Mhealth. 2025 Jul 10;11:29. doi: 10.21037/mhealth-24-60. eCollection 2025.
6
Shared decision-making interventions for people with mental health conditions.心理健康问题患者的共同决策干预措施。
Cochrane Database Syst Rev. 2022 Nov 11;11(11):CD007297. doi: 10.1002/14651858.CD007297.pub3.
7
Developing an ICD-10 Coding Assistant: Pilot Study Using RoBERTa and GPT-4 for Term Extraction and Description-Based Code Selection.开发国际疾病分类第十版(ICD - 10)编码助手:使用RoBERTa和GPT - 4进行术语提取和基于描述的代码选择的试点研究
JMIR Form Res. 2025 Feb 11;9:e60095. doi: 10.2196/60095.
8
Watch and learn: leveraging expert knowledge and language for surgical video understanding.观看并学习:利用专业知识和语言进行手术视频理解。
Int J Comput Assist Radiol Surg. 2025 Jul 2. doi: 10.1007/s11548-025-03472-4.
9
Leveraging multimodal large language model for multimodal sequential recommendation.利用多模态大语言模型进行多模态序列推荐。
Sci Rep. 2025 Aug 7;15(1):28960. doi: 10.1038/s41598-025-14251-1.
10
How well do multimodal LLMs interpret CT scans? An auto-evaluation framework for analyses.多模态语言模型对CT扫描的解读效果如何?一个用于分析的自动评估框架。
J Biomed Inform. 2025 Aug;168:104864. doi: 10.1016/j.jbi.2025.104864. Epub 2025 Jun 25.

引用本文的文献

1
Describe Now: User-Driven Audio Description for Blind and Low Vision Individuals.即时描述:面向盲人和低视力者的用户驱动音频描述
DIS (Des Interact Syst Conf). 2025 Jul;2025:458-474. doi: 10.1145/3715336.3735685. Epub 2025 Jul 4.

本文引用的文献

1
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset.VALOR:视听语言全感知预训练模型与数据集
IEEE Trans Pattern Anal Mach Intell. 2025 Feb;47(2):708-724. doi: 10.1109/TPAMI.2024.3479776. Epub 2025 Jan 9.
2
The Efficacy of Collaborative Authoring of Video Scene Descriptions.视频场景描述协作创作的效果
ASSETS. 2021;17. doi: 10.1145/3441852.3471201.
3
Revision of visual impairment definitions in the International Statistical Classification of Diseases.《国际疾病分类》中视力损害定义的修订
BMC Med. 2006 Mar 16;4:7. doi: 10.1186/1741-7015-4-7.

VideoA11y:无障碍视频描述的方法与数据集。

VideoA11y: Method and Dataset for Accessible Video Description.

作者信息

Li Chaoyu, Padmanabhuni Sid, Cheema Maryam S, Seifi Hasti, Fazli Pooyan

机构信息

School of Computing and Augmented Intelligence, Arizona State University, Tempe, Arizona, USA.

School of Arts, Media and Engineering, Arizona State University, Tempe, Arizona, USA.

出版信息

Proc SIGCHI Conf Hum Factor Comput Syst. 2025 Apr-May;2025. doi: 10.1145/3706598.3714096. Epub 2025 Apr 25.

DOI:10.1145/3706598.3714096
PMID:40894856
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12398407/
Abstract

Video descriptions are crucial for blind and low vision (BLV) users to access visual content. However, current artificial intelligence models for generating descriptions often fall short due to limitations in the quality of human annotations within training datasets, resulting in descriptions that do not fully meet BLV users' needs. To address this gap, we introduce VideoA11y, an approach that leverages multimodal large language models (MLLMs) and video accessibility guidelines to generate descriptions tailored for BLV individuals. Using this method, we have curated VideoA11y-40K, the largest and most comprehensive dataset of 40,000 videos described for BLV users. Rigorous experiments across 15 video categories, involving 347 sighted participants, 40 BLV participants, and seven professional describers, showed that VideoA11y descriptions outperform novice human annotations and are comparable to trained human annotations in clarity, accuracy, objectivity, descriptiveness, and user satisfaction. We evaluated models on VideoA11y-40K using both standard and custom metrics, demonstrating that MLLMs fine-tuned on this dataset produce high-quality accessible descriptions. Code and dataset are available at https://people-robots.github.io/VideoA11y/.

摘要

视频描述对于盲人及低视力(BLV)用户访问视觉内容至关重要。然而,由于训练数据集中人类标注质量的限制,当前用于生成描述的人工智能模型往往存在不足,导致生成的描述不能完全满足BLV用户的需求。为了弥补这一差距,我们引入了VideoA11y,这是一种利用多模态大语言模型(MLLM)和视频无障碍指南为BLV个体生成定制描述的方法。使用这种方法,我们精心整理了VideoA11y - 40K,这是为BLV用户描述的40000个视频组成的最大且最全面的数据集。在15个视频类别上进行的严格实验,涉及347名视力正常的参与者、40名BLV参与者和7名专业描述者,结果表明VideoA11y生成的描述优于新手人类标注,并且在清晰度、准确性、客观性、描述性和用户满意度方面与经过训练的人类标注相当。我们使用标准和自定义指标在VideoA11y - 40K上对模型进行了评估,证明在此数据集上微调的MLLM能够生成高质量的无障碍描述。代码和数据集可在https://people - robots.github.io/VideoA11y/获取。