School of Information Science & Engineering, Shandong Normal University, Jinan, China.
College of Intelligence and Information Engineering, Shandong University of Traditional Chinese Medicine, Jinan, China.
PLoS One. 2022 Oct 6;17(10):e0275156. doi: 10.1371/journal.pone.0275156. eCollection 2022.
Video question answering (Video-QA) is a subject undergoing intense study in Artificial Intelligence, which is one of the tasks which can evaluate such AI abilities. In this paper, we propose a Modality Attention Fusion framework with Hybrid Multi-head Self-attention (MAF-HMS). MAF-HMS focuses on the task of answering multiple-choice questions regarding a video-subtitle-QA representation by fusion of attention and self-attention between each modality. We use BERT to extract text features, and use Faster R-CNN to ex-tract visual features to provide a useful input representation for our model to answer questions. In addition, we have constructed a Modality Attention Fusion (MAF) framework for the attention fusion matrix from different modalities (video, subtitles, QA), and use a Hybrid Multi-headed Self-attention (HMS) to further determine the correct answer. Experiments on three separate scene datasets show our overall model outperforms the baseline methods by a large margin. Finally, we conducted extensive ablation studies to verify the various components of the network and demonstrate the effectiveness and advantages of our method over existing methods through question type and required modality experimental results.
视频问答(Video-QA)是人工智能领域中一个备受关注的研究课题,它是评估人工智能能力的任务之一。在本文中,我们提出了一种模态注意力融合框架,该框架采用混合多头自注意力(MAF-HMS)。MAF-HMS 专注于通过融合每种模态之间的注意力和自注意力来回答关于视频字幕问答表示的多项选择题的任务。我们使用 BERT 提取文本特征,使用 Faster R-CNN 提取视觉特征,为模型提供有用的输入表示,以回答问题。此外,我们还构建了一个模态注意力融合(MAF)框架,用于融合来自不同模态(视频、字幕、QA)的注意力融合矩阵,并使用混合多头自注意力(HMS)进一步确定正确答案。在三个独立的场景数据集上的实验表明,我们的整体模型在性能上明显优于基线方法。最后,我们进行了广泛的消融研究,以验证网络的各个组成部分,并通过问题类型和所需模态的实验结果证明了我们的方法相对于现有方法的有效性和优势。