Suppr超能文献

VisQA:在 Transformer 中对视觉与语言推理进行 X 光检查。

VisQA: X-raying Vision and Language Reasoning in Transformers.

出版信息

IEEE Trans Vis Comput Graph. 2022 Jan;28(1):976-986. doi: 10.1109/TVCG.2021.3114683. Epub 2021 Dec 24.

Abstract

Visual Question Answering systems target answering open-ended textual questions given input images. They are a testbed for learning high-level reasoning with a primary use in HCI, for instance assistance for the visually impaired. Recent research has shown that state-of-the-art models tend to produce answers exploiting biases and shortcuts in the training data, and sometimes do not even look at the input image, instead of performing the required reasoning steps. We present VisQA, a visual analytics tool that explores this question of reasoning vs. bias exploitation. It exposes the key element of state-of-the-art neural models - attention maps in transformers. Our working hypothesis is that reasoning steps leading to model predictions are observable from attention distributions, which are particularly useful for visualization. The design process of VisQA was motivated by well-known bias examples from the fields of deep learning and vision-language reasoning and evaluated in two ways. First, as a result of a collaboration of three fields, machine learning, vision and language reasoning, and data analytics, the work lead to a better understanding of bias exploitation of neural models for VQA, which eventually resulted in an impact on its design and training through the proposition of a method for the transfer of reasoning patterns from an oracle model. Second, we also report on the design of VisQA, and a goal-oriented evaluation of VisQA targeting the analysis of a model decision process from multiple experts, providing evidence that it makes the inner workings of models accessible to users.

摘要

视觉问答系统旨在回答给定输入图像的开放式文本问题。它们是学习高级推理的测试平台,主要用于人机交互,例如为视障人士提供帮助。最近的研究表明,最先进的模型往往会利用训练数据中的偏差和捷径来生成答案,有时甚至根本不看输入图像,而不是执行所需的推理步骤。我们提出了 VisQA,这是一种视觉分析工具,用于探索推理与偏差利用的问题。它揭示了最先进的神经模型的关键要素——转换器中的注意力图。我们的工作假设是,导致模型预测的推理步骤可以从注意力分布中观察到,这对于可视化特别有用。VisQA 的设计过程受到来自深度学习和视觉语言推理领域的知名偏差示例的启发,并通过两种方式进行了评估。首先,作为机器学习、视觉和语言推理以及数据分析三个领域合作的结果,这项工作深入了解了神经模型对 VQA 的偏差利用,最终通过提出一种从 oracle 模型转移推理模式的方法对其设计和训练产生了影响。其次,我们还报告了 VisQA 的设计,并针对从多个专家分析模型决策过程进行了目标导向的评估,提供了证据表明它使用户能够访问模型的内部工作原理。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验