Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
Sci Data. 2023 Mar 22;10(1):158. doi: 10.1038/s41597-023-02036-y.
This paper introduces a new challenge and datasets to foster research toward designing systems that can understand medical videos and provide visual answers to natural language questions. We believe medical videos may provide the best possible answers to many first aid, medical emergency, and medical education questions. Toward this, we created the MedVidCL and MedVidQA datasets and introduce the tasks of Medical Video Classification (MVC) and Medical Visual Answer Localization (MVAL), two tasks that focus on cross-modal (medical language and medical video) understanding. The proposed tasks and datasets have the potential to support the development of sophisticated downstream applications that can benefit the public and medical practitioners. Our datasets consist of 6,117 fine-grained annotated videos for the MVC task and 3,010 questions and answers timestamps from 899 videos for the MVAL task. These datasets have been verified and corrected by medical informatics experts. We have also benchmarked each task with the created MedVidCL and MedVidQA datasets and propose the multimodal learning methods that set competitive baselines for future research.
本文提出了一个新的挑战和数据集,以促进设计能够理解医学视频并提供自然语言问题视觉答案的系统的研究。我们相信医学视频可以为许多急救、医疗紧急情况和医学教育问题提供最佳答案。为此,我们创建了 MedVidCL 和 MedVidQA 数据集,并引入了医学视频分类(MVC)和医学视觉答案定位(MVAL)两个任务,这两个任务都侧重于跨模态(医学语言和医学视频)理解。拟议的任务和数据集有可能支持开发复杂的下游应用程序,使公众和医疗从业者受益。我们的数据集包括 6117 个精细注释的视频,用于 MVC 任务,以及 3010 个问题和 899 个视频的答案时间戳,用于 MVAL 任务。这些数据集已经过医学信息学专家的验证和纠正。我们还使用创建的 MedVidCL 和 MedVidQA 数据集对每个任务进行了基准测试,并提出了多模态学习方法,为未来的研究设定了有竞争力的基线。