School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, 510275, China.
School of Information Engineering, Guangdong University of Technology, Guangzhou, 510006, China.
Neural Netw. 2024 Dec;180:106686. doi: 10.1016/j.neunet.2024.106686. Epub 2024 Aug 31.
Vision Transformer have achieved impressive performance in image super-resolution. However, they suffer from low inference speed mainly because of the quadratic complexity of multi-head self-attention (MHSA), which is the key to learning long-range dependencies. On the contrary, most CNN-based methods neglect the important effect of global contextual information, resulting in inaccurate and blurring details. If one can make the best of both Transformers and CNNs, it will achieve a better trade-off between image quality and inference speed. Based on this observation, firstly assume that the main factor affecting the performance in the Transformer-based SR models is the general architecture design, not the specific MHSA component. To verify this, some ablation studies are made by replacing MHSA with large kernel convolutions, alongside other essential module replacements. Surprisingly, the derived models achieve competitive performance. Therefore, a general architecture design GlobalSR is extracted by not specifying the core modules including blocks and domain embeddings of Transformer-based SR models. It also contains three practical guidelines for designing a lightweight SR network utilizing image-level global contextual information to reconstruct SR images. Following the guidelines, the blocks and domain embeddings of GlobalSR are instantiated via Deformable Convolution Attention Block (DCAB) and Fast Fourier Convolution Domain Embedding (FCDE), respectively. The instantiation of general architecture, termed GlobalSR-DF, proposes a DCA to extract the global contextual feature by utilizing Deformable Convolution and a Hadamard product as the attention map at the block level. Meanwhile, the FCDE utilizes the Fast Fourier to transform the input spatial feature into frequency space and then extract image-level global information from it by convolutions. Extensive experiments demonstrate that GlobalSR is the key part in achieving a superior trade-off between SR quality and efficiency. Specifically, our proposed GlobalSR-DF outperforms state-of-the-art CNN-based and ViT-based SISR models regarding accuracy-speed trade-offs with sharp and natural details.
视觉Transformer 在图像超分辨率方面取得了令人瞩目的性能。然而,它们的推理速度较慢,主要是因为多头自注意力(Multi-Head Self-Attention,MHSA)的二次复杂度是学习长距离依赖关系的关键。相比之下,大多数基于卷积神经网络(Convolutional Neural Network,CNN)的方法忽略了全局上下文信息的重要影响,导致细节不准确和模糊。如果能够充分利用 Transformer 和 CNN 的优势,将在图像质量和推理速度之间取得更好的平衡。基于这一观察,首先假设影响基于 Transformer 的 SR 模型性能的主要因素是一般的架构设计,而不是特定的 MHSA 组件。为了验证这一点,通过用大核卷积替换 MHSA 以及其他必要的模块替换,进行了一些消融研究。令人惊讶的是,得到的模型取得了有竞争力的性能。因此,通过不指定基于 Transformer 的 SR 模型的核心模块(包括块和域嵌入),提取了一个通用架构设计 GlobalSR。它还包含了利用图像级全局上下文信息重建 SR 图像的三个实用设计指南。遵循这些指南,通过可变形卷积注意力块(Deformable Convolution Attention Block,DCAB)和快速傅里叶卷积域嵌入(Fast Fourier Convolution Domain Embedding,FCDE)分别实例化 GlobalSR 的块和域嵌入。通用架构的实例化,称为 GlobalSR-DF,通过利用可变形卷积和 Hadamard 积作为块级别的注意力图,提出了一个 DCA 来提取全局上下文特征。同时,FCDE 利用快速傅里叶变换将输入的空间特征转换到频域,并通过卷积从中提取图像级全局信息。广泛的实验证明,GlobalSR 是在 SR 质量和效率之间取得卓越平衡的关键部分。具体来说,我们提出的 GlobalSR-DF 在准确性-速度权衡方面优于最先进的基于 CNN 和基于 ViT 的 SISR 模型,具有锐利和自然的细节。