• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

模态注意力融合模型,采用混合多头自注意力机制,用于视频理解。

Modality attention fusion model with hybrid multi-head self-attention for video understanding.

机构信息

School of Information Science & Engineering, Shandong Normal University, Jinan, China.

College of Intelligence and Information Engineering, Shandong University of Traditional Chinese Medicine, Jinan, China.

出版信息

PLoS One. 2022 Oct 6;17(10):e0275156. doi: 10.1371/journal.pone.0275156. eCollection 2022.

DOI:10.1371/journal.pone.0275156
PMID:36201513
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9536548/
Abstract

Video question answering (Video-QA) is a subject undergoing intense study in Artificial Intelligence, which is one of the tasks which can evaluate such AI abilities. In this paper, we propose a Modality Attention Fusion framework with Hybrid Multi-head Self-attention (MAF-HMS). MAF-HMS focuses on the task of answering multiple-choice questions regarding a video-subtitle-QA representation by fusion of attention and self-attention between each modality. We use BERT to extract text features, and use Faster R-CNN to ex-tract visual features to provide a useful input representation for our model to answer questions. In addition, we have constructed a Modality Attention Fusion (MAF) framework for the attention fusion matrix from different modalities (video, subtitles, QA), and use a Hybrid Multi-headed Self-attention (HMS) to further determine the correct answer. Experiments on three separate scene datasets show our overall model outperforms the baseline methods by a large margin. Finally, we conducted extensive ablation studies to verify the various components of the network and demonstrate the effectiveness and advantages of our method over existing methods through question type and required modality experimental results.

摘要

视频问答(Video-QA)是人工智能领域中一个备受关注的研究课题,它是评估人工智能能力的任务之一。在本文中,我们提出了一种模态注意力融合框架,该框架采用混合多头自注意力(MAF-HMS)。MAF-HMS 专注于通过融合每种模态之间的注意力和自注意力来回答关于视频字幕问答表示的多项选择题的任务。我们使用 BERT 提取文本特征,使用 Faster R-CNN 提取视觉特征,为模型提供有用的输入表示,以回答问题。此外,我们还构建了一个模态注意力融合(MAF)框架,用于融合来自不同模态(视频、字幕、QA)的注意力融合矩阵,并使用混合多头自注意力(HMS)进一步确定正确答案。在三个独立的场景数据集上的实验表明,我们的整体模型在性能上明显优于基线方法。最后,我们进行了广泛的消融研究,以验证网络的各个组成部分,并通过问题类型和所需模态的实验结果证明了我们的方法相对于现有方法的有效性和优势。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9f7/9536548/b0e5fbbb1aed/pone.0275156.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9f7/9536548/1988a8a1765b/pone.0275156.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9f7/9536548/a1627407db13/pone.0275156.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9f7/9536548/34fbe9e6a52f/pone.0275156.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9f7/9536548/10426a8c2153/pone.0275156.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9f7/9536548/7d0e913d80cc/pone.0275156.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9f7/9536548/b0e5fbbb1aed/pone.0275156.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9f7/9536548/1988a8a1765b/pone.0275156.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9f7/9536548/a1627407db13/pone.0275156.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9f7/9536548/34fbe9e6a52f/pone.0275156.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9f7/9536548/10426a8c2153/pone.0275156.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9f7/9536548/7d0e913d80cc/pone.0275156.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f9f7/9536548/b0e5fbbb1aed/pone.0275156.g006.jpg

相似文献

1
Modality attention fusion model with hybrid multi-head self-attention for video understanding.模态注意力融合模型,采用混合多头自注意力机制,用于视频理解。
PLoS One. 2022 Oct 6;17(10):e0275156. doi: 10.1371/journal.pone.0275156. eCollection 2022.
2
Multi-Turn Video Question Answering via Hierarchical Attention Context Reinforced Networks.通过分层注意力上下文增强网络实现多轮视频问答
IEEE Trans Image Process. 2019 Aug;28(8):3860-3872. doi: 10.1109/TIP.2019.2902106. Epub 2019 Feb 27.
3
Compositional Attention Networks with Two-Stream Fusion for Video Question Answering.用于视频问答的双流融合组合注意力网络。
IEEE Trans Image Process. 2019 Sep 16. doi: 10.1109/TIP.2019.2940677.
4
Unifying the Video and Question Attentions for Open-Ended Video Question Answering.统一视频和问题注意力以进行开放式视频问答。
IEEE Trans Image Process. 2017 Dec;26(12):5656-5666. doi: 10.1109/TIP.2017.2746267. Epub 2017 Aug 29.
5
A Depth Evidence Score Fusion Algorithm for Chinese Medical Intelligence Question Answering System.一种用于中文医疗智能问答系统的深度证据融合算法。
J Healthc Eng. 2018 Jul 10;2018:1205354. doi: 10.1155/2018/1205354. eCollection 2018.
6
A Short Video Classification Framework Based on Cross-Modal Fusion.基于跨模态融合的短视频分类框架
Sensors (Basel). 2023 Oct 12;23(20):8425. doi: 10.3390/s23208425.
7
Towards Visual-Prompt Temporal Answer Grounding in Instructional Video.迈向教学视频中的视觉提示时间答案定位
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8836-8853. doi: 10.1109/TPAMI.2024.3411045. Epub 2024 Nov 6.
8
Bilinear pooling in video-QA: empirical challenges and motivational drift from neurological parallels.视频问答中的双线性池化:来自神经学相似性的实证挑战与动机漂移
PeerJ Comput Sci. 2022 Jun 3;8:e974. doi: 10.7717/peerj-cs.974. eCollection 2022.
9
Medical visual question answering via corresponding feature fusion combined with semantic attention.基于对应特征融合和语义注意力的医学视觉问答。
Math Biosci Eng. 2022 Jul 20;19(10):10192-10212. doi: 10.3934/mbe.2022478.
10
Hierarchical fusion of common sense knowledge and classifier decisions for answer selection in community question answering.常识知识和分类器决策的层次融合在社区问答中的答案选择。
Neural Netw. 2020 Dec;132:53-65. doi: 10.1016/j.neunet.2020.08.005. Epub 2020 Aug 20.

引用本文的文献

1
Scene-dependent sound event detection based on multitask learning with deformable large kernel attention convolution.基于具有可变形大内核注意力卷积的多任务学习的场景依赖声音事件检测
PLoS One. 2025 May 9;20(5):e0322002. doi: 10.1371/journal.pone.0322002. eCollection 2025.

本文引用的文献

1
Holistic Multi-modal Memory Network for Movie Question Answering.用于电影问答的整体多模态记忆网络
IEEE Trans Image Process. 2019 Aug 2. doi: 10.1109/TIP.2019.2931534.
2
Long-Form Video Question Answering via Dynamic Hierarchical Reinforced Networks.通过动态分层强化网络实现长视频问答
IEEE Trans Image Process. 2019 Dec;28(12):5939-5952. doi: 10.1109/TIP.2019.2922062. Epub 2019 Jun 17.
3
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.更快的 R-CNN:基于区域建议网络的实时目标检测。
IEEE Trans Pattern Anal Mach Intell. 2017 Jun;39(6):1137-1149. doi: 10.1109/TPAMI.2016.2577031. Epub 2016 Jun 6.