无需微调加速用于密集预测的大规模视觉Transformer

Expediting Large-Scale Vision Transformer for Dense Prediction Without Fine-Tuning.

作者信息

Yuan Yuhui, Liang Weicong, Ding Henghui, Liang Zhanhao, Zhang Chao, Hu Han

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Jan;46(1):250-266. doi: 10.1109/TPAMI.2023.3327511. Epub 2023 Dec 5.

DOI:10.1109/TPAMI.2023.3327511

Abstract

In a wide range of dense prediction tasks, large-scale Vision Transformers have achieved state-of-the-art performance while requiring expensive computation. In contrast to most existing approaches accelerating Vision Transformers for image classification, we focus on accelerating Vision Transformers for dense prediction without any fine-tuning. We present two non-parametric operators specialized for dense prediction tasks, a token clustering layer to decrease the number of tokens for expediting and a token reconstruction layer to increase the number of tokens for recovering high-resolution. To accomplish this, the following steps are taken: i) token clustering layer is employed to cluster the neighboring tokens and yield low-resolution representations with spatial structures; ii) the following transformer layers are performed only to these clustered low-resolution tokens; and iii) reconstruction of high-resolution representations from refined low-resolution representations is accomplished using token reconstruction layer. The proposed approach shows promising results consistently on 6 dense prediction tasks, including object detection, semantic segmentation, panoptic segmentation, instance segmentation, depth estimation, and video instance segmentation. Additionally, we validate the effectiveness of the proposed approach on the very recent state-of-the-art open-vocabulary recognition methods. Furthermore, a number of recent representative approaches are benchmarked and compared on dense prediction tasks.

摘要

在广泛的密集预测任务中，大规模视觉Transformer在需要昂贵计算的情况下取得了领先的性能。与大多数现有的加速用于图像分类的视觉Transformer的方法不同，我们专注于在不进行任何微调的情况下加速用于密集预测的视觉Transformer。我们提出了两个专门用于密集预测任务的非参数算子，一个令牌聚类层用于减少令牌数量以加快速度，一个令牌重建层用于增加令牌数量以恢复高分辨率。为了实现这一点，采取了以下步骤：i）使用令牌聚类层对相邻令牌进行聚类，并生成具有空间结构的低分辨率表示；ii）仅对这些聚类后的低分辨率令牌执行后续的Transformer层；iii）使用令牌重建层从细化的低分辨率表示中重建高分辨率表示。所提出的方法在包括目标检测、语义分割、全景分割、实例分割、深度估计和视频实例分割在内的6个密集预测任务上始终显示出有前景的结果。此外，我们在最近的最先进的开放词汇识别方法上验证了所提出方法的有效性。此外，在密集预测任务上对一些近期具有代表性的方法进行了基准测试和比较。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

无需微调加速用于密集预测的大规模视觉Transformer

Expediting Large-Scale Vision Transformer for Dense Prediction Without Fine-Tuning.

作者信息

出版信息

相似文献

引用本文的文献

无需微调加速用于密集预测的大规模视觉Transformer

Expediting Large-Scale Vision Transformer for Dense Prediction Without Fine-Tuning.

作者信息

出版信息

相似文献

引用本文的文献