IEEE Trans Image Process. 2023;32:3027-3039. doi: 10.1109/TIP.2023.3275538. Epub 2023 May 26.
In recent years, various neural network architectures for computer vision have been devised, such as the visual transformer and multilayer perceptron (MLP). A transformer based on an attention mechanism can outperform a traditional convolutional neural network. Compared with the convolutional neural network and transformer, the MLP introduces less inductive bias and achieves stronger generalization. In addition, a transformer shows an exponential increase in the inference, training, and debugging times. Considering a wave function representation, we propose the WaveNet architecture that adopts a novel vision task-oriented wavelet-based MLP for feature extraction to perform salient object detection in RGB (red-green-blue)-thermal infrared images. In addition, we apply knowledge distillation to a transformer as an advanced teacher network to acquire rich semantic and geometric information and guide WaveNet learning with this information. Following the shortest-path concept, we adopt the Kullback-Leibler distance as a regularization term for the RGB features to be as similar to the thermal infrared features as possible. The discrete wavelet transform allows for the examination of frequency-domain features in a local time domain and time-domain features in a local frequency domain. We apply this representation ability to perform cross-modality feature fusion. Specifically, we introduce a progressively cascaded sine-cosine module for cross-layer feature fusion and use low-level features to obtain clear boundaries of salient objects through the MLP. Results from extensive experiments indicate that the proposed WaveNet achieves impressive performance on benchmark RGB-thermal infrared datasets. The results and code are publicly available at https://github.com/nowander/WaveNet.
近年来,已经设计出了各种用于计算机视觉的神经网络架构,例如视觉转换器和多层感知机(MLP)。基于注意力机制的转换器可以胜过传统的卷积神经网络。与卷积神经网络和转换器相比,MLP 引入的归纳偏差较少,实现了更强的泛化能力。此外,转换器在推理、训练和调试时间方面呈指数级增长。考虑到波函数表示,我们提出了 WaveNet 架构,该架构采用了一种新颖的面向视觉任务的基于小波的 MLP 进行特征提取,以在 RGB(红-绿-蓝)-热红外图像中执行显著目标检测。此外,我们将知识蒸馏应用于作为高级教师网络的转换器,以获取丰富的语义和几何信息,并利用这些信息指导 WaveNet 学习。遵循最短路径概念,我们采用 Kullback-Leibler 距离作为 RGB 特征的正则化项,以使它们尽可能类似于热红外特征。离散小波变换允许在局部时域中检查频域特征,以及在局部频域中检查时域特征。我们将这种表示能力应用于进行跨模态特征融合。具体来说,我们引入了一个逐步级联的正弦余弦模块,用于跨层特征融合,并通过 MLP 利用低层次特征获得显著目标的清晰边界。广泛的实验结果表明,所提出的 WaveNet 在基准 RGB-热红外数据集上取得了令人印象深刻的性能。结果和代码可在 https://github.com/nowander/WaveNet 上获得。