Chang Dongliang, Ding Yifeng, Xie Jiyang, Bhunia Ayan Kumar, Li Xiaoxu, Ma Zhanyu, Wu Ming, Guo Jun, Song Yi-Zhe
IEEE Trans Image Process. 2020 Feb 20. doi: 10.1109/TIP.2020.2973812.
The key to solving fine-grained image categorization is finding discriminate and local regions that correspond to subtle visual traits. Great strides have been made, with complex networks designed specifically to learn part-level discriminate feature representations. In this paper, we show that it is possible to cultivate subtle details without the need for overly complicated network designs or training mechanisms - a single loss is all it takes. The main trick lies with how we delve into individual feature channels early on, as opposed to the convention of starting from a consolidated feature map. The proposed loss function, termed as mutual-channel loss (MC-Loss), consists of two channel-specific components: a discriminality component and a diversity component. The discriminality component forces all feature channels belonging to the same class to be discriminative, through a novel channel-wise attention mechanism. The diversity component additionally constraints channels so that they become mutually exclusive across the spatial dimension. The end result is therefore a set of feature channels, each of which reflects different locally discriminative regions for a specific class. The MC-Loss can be trained end-to-end, without the need for any bounding-box/part annotations, and yields highly discriminative regions during inference. Experimental results show our MC-Loss when implemented on top of common base networks can achieve state-of-the-art performance on all four fine-grained categorization datasets (CUB-Birds, FGVC-Aircraft, Flowers-102, and Stanford Cars). Ablative studies further demonstrate the superiority of the MC-Loss when compared with other recently proposed general-purpose losses for visual classification, on two different base networks.
解决细粒度图像分类问题的关键在于找到与细微视觉特征相对应的具有区分性的局部区域。在这方面已经取得了很大进展,设计了专门用于学习部分级区分特征表示的复杂网络。在本文中,我们表明,无需过于复杂的网络设计或训练机制就有可能培养出细微的细节——只需要一个损失函数即可。主要诀窍在于我们如何在早期深入研究各个特征通道,这与从合并的特征图开始的传统方法不同。所提出的损失函数称为互通道损失(MC-Loss),由两个特定于通道的组件组成:区分性组件和多样性组件。区分性组件通过一种新颖的逐通道注意力机制,迫使属于同一类别的所有特征通道具有区分性。多样性组件进一步约束通道,使其在空间维度上相互排斥。因此,最终结果是一组特征通道,每个通道都反映了特定类别的不同局部区分区域。MC-Loss可以端到端地进行训练,无需任何边界框/部分注释,并且在推理过程中产生高度区分性的区域。实验结果表明,当在常见的基础网络之上实现时,我们的MC-Loss在所有四个细粒度分类数据集(CUB-鸟类、FGVC-飞机、花卉-102和斯坦福汽车)上都能达到当前的最佳性能。消融研究进一步证明了与其他最近提出的用于视觉分类的通用损失相比,MC-Loss在两个不同基础网络上的优越性。