Suppr超能文献

注意力机制在视觉Transformer 中是如何工作的?一项可视化分析尝试。

How Does Attention Work in Vision Transformers? A Visual Analytics Attempt.

出版信息

IEEE Trans Vis Comput Graph. 2023 Jun;29(6):2888-2900. doi: 10.1109/TVCG.2023.3261935. Epub 2023 May 3.

Abstract

Vision transformer (ViT) expands the success of transformer models from sequential data to images. The model decomposes an image into many smaller patches and arranges them into a sequence. Multi-head self-attentions are then applied to the sequence to learn the attention between patches. Despite many successful interpretations of transformers on sequential data, little effort has been devoted to the interpretation of ViTs, and many questions remain unanswered. For example, among the numerous attention heads, which one is more important? How strong are individual patches attending to their spatial neighbors in different heads? What attention patterns have individual heads learned? In this work, we answer these questions through a visual analytics approach. Specifically, we first identify what heads are more important in ViTs by introducing multiple pruning-based metrics. Then, we profile the spatial distribution of attention strengths between patches inside individual heads, as well as the trend of attention strengths across attention layers. Third, using an autoencoder-based learning solution, we summarize all possible attention patterns that individual heads could learn. Examining the attention strengths and patterns of the important heads, we answer why they are important. Through concrete case studies with experienced deep learning experts on multiple ViTs, we validate the effectiveness of our solution that deepens the understanding of ViTs from head importance, head attention strength, and head attention pattern.

摘要

视觉转换器 (ViT) 将转换器模型从序列数据扩展到图像的成功。该模型将图像分解成许多较小的补丁,并将它们排列成一个序列。然后,多头自注意力被应用于该序列,以学习补丁之间的注意力。尽管在序列数据上有许多成功的转换器解释,但很少有人致力于解释 ViT,许多问题仍未得到解答。例如,在众多的注意力头中,哪一个更重要?不同头中的单个补丁与它们的空间邻居的注意力强度如何?单个头学到了哪些注意力模式?在这项工作中,我们通过可视化分析方法回答了这些问题。具体来说,我们首先通过引入多种基于修剪的指标来确定 ViTs 中哪些头更重要。然后,我们分析了单个头内补丁之间注意力强度的空间分布,以及注意力层之间注意力强度的趋势。第三,使用基于自动编码器的学习解决方案,我们总结了单个头可以学习的所有可能的注意力模式。通过对多个 ViTs 进行有经验的深度学习专家的具体案例研究,我们验证了我们的解决方案的有效性,该解决方案加深了对 ViTs 从头部重要性、头部注意力强度和头部注意力模式的理解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ecd/10290521/c8681fa395a0/nihms-1897774-f0001.jpg

相似文献

4
Towards Transferable Adversarial Attacks on Image and Video Transformers.面向图像和视频Transformer的可迁移对抗攻击
IEEE Trans Image Process. 2023;32:6346-6358. doi: 10.1109/TIP.2023.3331582. Epub 2023 Nov 20.
6
Token Selection is a Simple Booster for Vision Transformers.令牌选择是视觉Transformer的一种简单增强方法。
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):12738-12746. doi: 10.1109/TPAMI.2022.3208922. Epub 2023 Oct 3.
8
Pruning Self-Attentions Into Convolutional Layers in Single Path.在单路径中将自注意力机制精简为卷积层。
IEEE Trans Pattern Anal Mach Intell. 2024 May;46(5):3910-3922. doi: 10.1109/TPAMI.2024.3355890. Epub 2024 Apr 3.
10

本文引用的文献

1
A Visual Analytics System for Improving Attention-based Traffic Forecasting Models.
IEEE Trans Vis Comput Graph. 2023 Jan;29(1):1102-1112. doi: 10.1109/TVCG.2022.3209462. Epub 2022 Dec 16.
3
VisQA: X-raying Vision and Language Reasoning in Transformers.VisQA:在 Transformer 中对视觉与语言推理进行 X 光检查。
IEEE Trans Vis Comput Graph. 2022 Jan;28(1):976-986. doi: 10.1109/TVCG.2021.3114683. Epub 2021 Dec 24.
4
Attention Flows: Analyzing and Comparing Attention Mechanisms in Language Models.注意力流:分析和比较语言模型中的注意力机制。
IEEE Trans Vis Comput Graph. 2021 Feb;27(2):1160-1170. doi: 10.1109/TVCG.2020.3028976. Epub 2021 Jan 28.
8
Visual Analytics for Explainable Deep Learning.用于可解释深度学习的可视化分析
IEEE Comput Graph Appl. 2018 Jul/Aug;38(4):84-92. doi: 10.1109/MCG.2018.042731661.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验