Zhang Yan, Xu Fujie, Sun Yemei, Wang Jiao
College of Computer and Information Engineering, Tianjin Chengjian University, Tianjin 300384, China.
Neural Netw. 2025 Jul;187:107351. doi: 10.1016/j.neunet.2025.107351. Epub 2025 Mar 17.
Previous works have indicated that Transformer-based models bring impressive image reconstruction performance in single image super-resolution (SISR). However, existing Transformer-based approaches utilize self-attention within non-overlapping windows. This restriction hinders the network's ability to adopt large receptive fields, which are essential for capturing global information and establishing long-distance dependencies, especially in the early layers. To fully leverage global information and activate more pixels during the image reconstruction process, we have developed a Spatial and Frequency Information Fusion Transformer (SFFT) with an expansive receptive field. SFFT concurrently combines spatial and frequency domain information to comprehensively leverage their complementary strengths, capturing both local and global image features while integrating low and high-frequency information. Additionally, we utilize the overlapping cross-attention block (OCAB) to facilitate pixel transmission between adjacent windows, enhancing network performance. During the training stage, we incorporate the Fast Fourier Transform (FFT) loss, thereby fully leveraging the capabilities of our proposed modules and further tapping into the model's potential. Extensive quantitative and qualitative evaluations on benchmark datasets indicate that the proposed algorithm surpasses state-of-the-art methods in terms of accuracy. Specifically, our method achieves a PSNR score of 32.67 dB on the Manga109 dataset, surpassing SwinIR by 0.64 dB and HAT by 0.19 dB, respectively. The source code and pre-trained models are available at https://github.com/Xufujie/SFFT.
先前的研究表明,基于Transformer的模型在单图像超分辨率(SISR)中带来了令人印象深刻的图像重建性能。然而,现有的基于Transformer的方法在非重叠窗口内使用自注意力。这种限制阻碍了网络采用大感受野的能力,而大感受野对于捕获全局信息和建立长距离依赖至关重要,尤其是在早期层。为了在图像重建过程中充分利用全局信息并激活更多像素,我们开发了一种具有扩展感受野的空间和频率信息融合Transformer(SFFT)。SFFT同时结合空间和频域信息,以全面利用它们的互补优势,在整合低频和高频信息的同时捕获局部和全局图像特征。此外,我们利用重叠交叉注意力块(OCAB)来促进相邻窗口之间的像素传输,提高网络性能。在训练阶段,我们纳入了快速傅里叶变换(FFT)损失,从而充分利用我们提出的模块的能力,并进一步挖掘模型的潜力。在基准数据集上进行的广泛定量和定性评估表明,所提出的算法在准确性方面超过了现有方法。具体而言,我们的方法在Manga109数据集上实现了32.67 dB的PSNR分数,分别比SwinIR高出0.64 dB,比HAT高出0.19 dB。源代码和预训练模型可在https://github.com/Xufujie/SFFT上获取。