Yang Shengying, Yang Xinqi, Wu Jianfeng, Feng Boyang
Zhejiang University of Science and Technology, Hangzhou, 310023, China.
Zhejiang Shuren University, Hangzhou, 310023, China.
Sci Rep. 2024 Oct 14;14(1):24051. doi: 10.1038/s41598-024-74654-4.
The technique of extracting different distinguishing features by locating different part regions to achieve fine-grained visual classification (FGVC) has made significant improvements. Utilizing attention mechanisms for feature extraction has become one of the mainstream methods in computer vision, but these methods have certain limitations. They typically focus on the most discriminative regions and directly combine the features of these parts, neglecting other less prominent yet still discriminative regions. Additionally, these methods may not fully explore the intrinsic connections between higher-order and lower-order features to optimize model classification performance. By considering the potential relationships between different higher-order feature representations in the object image, we can enable the integrated higher-order features to contribute more significantly to the model's classification decision-making capabilities. To this end, we propose a saliency feature suppression and cross-feature fusion network model (SFSCF-Net) to explore the interaction learning between different higher-order feature representations. These include (1) an object-level image generator (OIG): the intersection of the output feature maps of the last two convolutional blocks of the backbone network is used as an object mask and mapped to the original image for cropping to obtain an object-level image, which can effectively reduce the interference caused by complex backgrounds. (2) A saliency feature suppression module (SFSM): the most distinguishing part of the object image is obtained by a feature extractor, and the part is masked by a two-dimensional suppression method, which improves the accuracy of feature suppression. (3) A cross-feature fusion method (CFM) based on inter-layer interaction: the output feature maps of different network layers are interactively integrated to obtain high-dimensional features, and then the high-dimensional features are channel compressed to obtain the inter-layer interaction feature representation, which enriches the output feature semantic information. The proposed SFSCF-Net can be trained end-to-end and achieves state-of-the-art or competitive results on four FGVC benchmark datasets.
通过定位不同的局部区域来提取不同的显著特征以实现细粒度视觉分类(FGVC)的技术已经取得了显著进展。利用注意力机制进行特征提取已成为计算机视觉中的主流方法之一,但这些方法存在一定局限性。它们通常聚焦于最具判别力的区域,并直接组合这些部分的特征,而忽略了其他不太突出但仍具有判别力的区域。此外,这些方法可能无法充分探索高阶特征与低阶特征之间的内在联系以优化模型分类性能。通过考虑目标图像中不同高阶特征表示之间的潜在关系,我们可以使集成的高阶特征对模型的分类决策能力做出更显著的贡献。为此,我们提出了一种显著特征抑制与交叉特征融合网络模型(SFSCF-Net)来探索不同高阶特征表示之间的交互学习。这些包括:(1)一个对象级图像生成器(OIG):将骨干网络最后两个卷积块的输出特征图的交集用作对象掩码,并映射到原始图像进行裁剪以获得对象级图像,这可以有效减少复杂背景造成的干扰。(2)一个显著特征抑制模块(SFSM):通过特征提取器获取目标图像中最具区分性的部分,并采用二维抑制方法对该部分进行掩码处理,提高了特征抑制的准确性。(3)一种基于层间交互的交叉特征融合方法(CFM):对不同网络层的输出特征图进行交互集成以获得高维特征,然后对高维特征进行通道压缩以获得层间交互特征表示,丰富了输出特征的语义信息。所提出的SFSCF-Net可以进行端到端训练,并在四个FGVC基准数据集上取得了领先或具有竞争力的结果。