Rao Yongming, Liu Zuyan, Zhao Wenliang, Zhou Jie, Lu Jiwen
IEEE Trans Pattern Anal Mach Intell. 2023 Sep;45(9):10883-10897. doi: 10.1109/TPAMI.2023.3263826. Epub 2023 Aug 7.
In this paper, we present a new approach for model acceleration by exploiting spatial sparsity in visual data. We observe that the final prediction in vision Transformers is only based on a subset of the most informative regions, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input to accelerate vision Transformers. Specifically, we devise a lightweight prediction module to estimate the importance of each token given the current features. The module is added to different layers to prune redundant tokens hierarchically. While the framework is inspired by our observation of the sparse attention in vision Transformers, we find that the idea of adaptive and asymmetric computation can be a general solution for accelerating various architectures. We extend our method to hierarchical models including CNNs and hierarchical vision Transformers as well as more complex dense prediction tasks. To handle structured feature maps, we formulate a generic dynamic spatial sparsification framework with progressive sparsification and asymmetric computation for different spatial locations. By applying lightweight fast paths to less informative features and expressive slow paths to important locations, we can maintain the complete structure of feature maps while significantly reducing the overall computations. Extensive experiments on diverse modern architectures and different visual tasks demonstrate the effectiveness of our proposed framework. By hierarchically pruning 66% of the input tokens, our method greatly reduces 31% ∼ 35% FLOPs and improves the throughput by over 40% while the drop of accuracy is within 0.5% for various vision Transformers. By introducing asymmetric computation, a similar acceleration can be achieved on modern CNNs and Swin Transformers. Moreover, our method achieves promising results on more complex tasks including semantic segmentation and object detection. Our results clearly demonstrate that dynamic spatial sparsification offers a new and more effective dimension for model acceleration. Code is available at https://github.com/raoyongming/DynamicViT.
在本文中,我们提出了一种利用视觉数据中的空间稀疏性进行模型加速的新方法。我们观察到,视觉Transformer中的最终预测仅基于信息量最大的区域的一个子集,这对于准确的图像识别就足够了。基于这一观察结果,我们提出了一种动态令牌稀疏化框架,以根据输入逐步动态地修剪冗余令牌,从而加速视觉Transformer。具体来说,我们设计了一个轻量级预测模块,以根据当前特征估计每个令牌的重要性。该模块被添加到不同层,以分层修剪冗余令牌。虽然该框架的灵感来自于我们对视觉Transformer中稀疏注意力的观察,但我们发现自适应和不对称计算的思想可以成为加速各种架构的通用解决方案。我们将我们的方法扩展到包括卷积神经网络(CNN)和分层视觉Transformer在内的分层模型,以及更复杂的密集预测任务。为了处理结构化特征图,我们制定了一个通用的动态空间稀疏化框架,用于不同空间位置的渐进稀疏化和不对称计算。通过将轻量级快速路径应用于信息量较少的特征,并将表现力强的慢速路径应用于重要位置,我们可以在显著减少总体计算量的同时保持特征图的完整结构。在各种现代架构和不同视觉任务上进行的大量实验证明了我们提出的框架的有效性。通过分层修剪66%的输入令牌,我们的方法大大减少了31%至35%的浮点运算量(FLOPs),并将吞吐量提高了40%以上,同时对于各种视觉Transformer,准确率下降在0.5%以内。通过引入不对称计算,可以在现代CNN和Swin Transformer上实现类似的加速。此外,我们的方法在包括语义分割和目标检测在内的更复杂任务上取得了有希望的结果。我们的结果清楚地表明,动态空间稀疏化为模型加速提供了一个新的、更有效的维度。代码可在https://github.com/raoyongming/DynamicViT获取。