IEEE Trans Vis Comput Graph. 2023 Jun;29(6):2888-2900. doi: 10.1109/TVCG.2023.3261935. Epub 2023 May 3.
Vision transformer (ViT) expands the success of transformer models from sequential data to images. The model decomposes an image into many smaller patches and arranges them into a sequence. Multi-head self-attentions are then applied to the sequence to learn the attention between patches. Despite many successful interpretations of transformers on sequential data, little effort has been devoted to the interpretation of ViTs, and many questions remain unanswered. For example, among the numerous attention heads, which one is more important? How strong are individual patches attending to their spatial neighbors in different heads? What attention patterns have individual heads learned? In this work, we answer these questions through a visual analytics approach. Specifically, we first identify what heads are more important in ViTs by introducing multiple pruning-based metrics. Then, we profile the spatial distribution of attention strengths between patches inside individual heads, as well as the trend of attention strengths across attention layers. Third, using an autoencoder-based learning solution, we summarize all possible attention patterns that individual heads could learn. Examining the attention strengths and patterns of the important heads, we answer why they are important. Through concrete case studies with experienced deep learning experts on multiple ViTs, we validate the effectiveness of our solution that deepens the understanding of ViTs from head importance, head attention strength, and head attention pattern.
视觉转换器 (ViT) 将转换器模型从序列数据扩展到图像的成功。该模型将图像分解成许多较小的补丁,并将它们排列成一个序列。然后,多头自注意力被应用于该序列,以学习补丁之间的注意力。尽管在序列数据上有许多成功的转换器解释,但很少有人致力于解释 ViT,许多问题仍未得到解答。例如,在众多的注意力头中,哪一个更重要?不同头中的单个补丁与它们的空间邻居的注意力强度如何?单个头学到了哪些注意力模式?在这项工作中,我们通过可视化分析方法回答了这些问题。具体来说,我们首先通过引入多种基于修剪的指标来确定 ViTs 中哪些头更重要。然后,我们分析了单个头内补丁之间注意力强度的空间分布,以及注意力层之间注意力强度的趋势。第三,使用基于自动编码器的学习解决方案,我们总结了单个头可以学习的所有可能的注意力模式。通过对多个 ViTs 进行有经验的深度学习专家的具体案例研究,我们验证了我们的解决方案的有效性,该解决方案加深了对 ViTs 从头部重要性、头部注意力强度和头部注意力模式的理解。