Winterbottom Thomas, Xiao Sarah, McLean Alistair, Al Moubayed Noura
Department of Computer Science, Durham University, Durham, United Kingdom.
Durham University Business School, Durham University, Durham, Durham, United Kingdom.
PeerJ Comput Sci. 2022 Jun 3;8:e974. doi: 10.7717/peerj-cs.974. eCollection 2022.
Bilinear pooling (BLP) refers to a family of operations recently developed for fusing features from different modalities predominantly for visual question answering (VQA) models. Successive BLP techniques have yielded higher performance with lower computational expense, yet at the same time they have drifted further from the original motivational justification of bilinear models, instead becoming empirically motivated by task performance. Furthermore, despite significant success in text-image fusion in VQA, BLP has not yet gained such notoriety in video question answering (video-QA). Though BLP methods have continued to perform well on video tasks when fusing vision and non-textual features, BLP has recently been overshadowed by other vision and textual feature fusion techniques in video-QA. We aim to add a new perspective to the empirical and motivational drift in BLP. We take a step back and discuss the motivational origins of BLP, highlighting the often-overlooked parallels to neurological theories (Dual Coding Theory and The Two-Stream Model of Vision). We seek to carefully and experimentally ascertain the empirical strengths and limitations of BLP as a multimodal text-vision fusion technique in video-QA using two models (TVQA baseline and heterogeneous-memory-enchanced 'HME' model) and four datasets (TVQA, TGif-QA, MSVD-QA, and EgoVQA). We examine the impact of both simply replacing feature concatenation in the existing models with BLP, and a modified version of the TVQA baseline to accommodate BLP that we name the 'dual-stream' model. We find that our relatively simple integration of BLP does not increase, and mostly harms, performance on these video-QA benchmarks. Using our insights on recent work in BLP for video-QA results and recently proposed theoretical multimodal fusion taxonomies, we offer insight into why BLP-driven performance gain for video-QA benchmarks may be more difficult to achieve than in earlier VQA models. We share our perspective on, and suggest solutions for, the key issues we identify with BLP techniques for multimodal fusion in video-QA. We look beyond the empirical justification of BLP techniques and propose both alternatives and improvements to multimodal fusion by drawing neurological inspiration from Dual Coding Theory and the Two-Stream Model of Vision. We qualitatively highlight the potential for neurological inspirations in video-QA by identifying the relative abundance of psycholinguistically 'concrete' words in the vocabularies for each of the text components ( questions and answers) of the four video-QA datasets we experiment with.
双线性池化(BLP)是指最近开发的一系列操作,主要用于融合不同模态的特征,以用于视觉问答(VQA)模型。相继出现的BLP技术以更低的计算成本取得了更高的性能,但与此同时,它们与双线性模型最初的动机依据渐行渐远,反而在经验上受任务性能的驱动。此外,尽管BLP在VQA的文本 - 图像融合方面取得了显著成功,但在视频问答(video - QA)中尚未获得如此高的知名度。虽然BLP方法在融合视觉和非文本特征时在视频任务上持续表现良好,但最近在video - QA中被其他视觉和文本特征融合技术所掩盖。我们旨在为BLP中的经验性和动机性漂移增添一个新视角。我们退一步讨论BLP的动机起源,强调与神经学理论(双重编码理论和视觉双流模型)常常被忽视的相似之处。我们试图谨慎地通过实验确定BLP作为video - QA中多模态文本 - 视觉融合技术的经验优势和局限性,使用两个模型(TVQA基线模型和异构内存增强的“HME”模型)以及四个数据集(TVQA、TGif - QA、MSVD - QA和EgoVQA)。我们研究了在现有模型中简单地用BLP替换特征拼接的影响,以及对TVQA基线模型的一个修改版本(我们称之为“双流”模型)以适应BLP的影响。我们发现,我们对BLP相对简单的整合并没有提高,而且在很大程度上损害了这些video - QA基准测试的性能。利用我们对BLP在video - QA结果方面近期工作的见解以及最近提出的理论多模态融合分类法,我们深入探讨了为什么在video - QA基准测试中实现BLP驱动的性能提升可能比早期VQA模型更困难。我们分享我们对video - QA中多模态融合的BLP技术所识别的关键问题的看法,并提出解决方案。我们超越了BLP技术的经验依据,并通过从双重编码理论和视觉双流模型中汲取神经学灵感,提出了多模态融合的替代方法和改进措施。我们通过确定我们实验的四个video - QA数据集中每个文本组件(问题和答案)词汇表中心理语言学上“具体”词汇相对丰富程度,定性地突出了video - QA中神经学灵感的潜力。