Wei Yunchao, Xia Wei, Lin Min, Huang Junshi, Ni Bingbing, Dong Jian, Zhao Yao, Yan Shuicheng
IEEE Trans Pattern Anal Mach Intell. 2016 Sep 1;38(9):1901-1907. doi: 10.1109/TPAMI.2015.2491929. Epub 2015 Oct 26.
Convolutional Neural Network (CNN) has demonstrated promising performance in single-label image classification tasks. However, how CNN best copes with multi-label images still remains an open problem, mainly due to the complex underlying object layouts and insufficient multi-label training images. In this work, we propose a flexible deep CNN infrastructure, called Hypotheses-CNN-Pooling (HCP), where an arbitrary number of object segment hypotheses are taken as the inputs, then a shared CNN is connected with each hypothesis, and finally the CNN output results from different hypotheses are aggregated with max pooling to produce the ultimate multi-label predictions. Some unique characteristics of this flexible deep CNN infrastructure include: 1) no ground-truth bounding box information is required for training; 2) the whole HCP infrastructure is robust to possibly noisy and/or redundant hypotheses; 3) the shared CNN is flexible and can be well pre-trained with a large-scale single-label image dataset, e.g., ImageNet; and 4) it may naturally output multi-label prediction results. Experimental results on Pascal VOC 2007 and VOC 2012 multi-label image datasets well demonstrate the superiority of the proposed HCP infrastructure over other state-of-the-arts. In particular, the mAP reaches 90.5% by HCP only and 93.2% after the fusion with our complementary result in [44] based on hand-crafted features on the VOC 2012 dataset.
卷积神经网络(CNN)在单标签图像分类任务中已展现出良好的性能。然而,CNN如何最佳地处理多标签图像仍是一个悬而未决的问题,主要原因在于复杂的底层物体布局以及多标签训练图像不足。在这项工作中,我们提出了一种灵活的深度CNN架构,称为假设-CNN-池化(HCP),其中任意数量的物体分割假设被用作输入,然后一个共享的CNN与每个假设相连,最后不同假设的CNN输出结果通过最大池化进行聚合,以产生最终的多标签预测。这种灵活的深度CNN架构的一些独特特性包括:1)训练时无需真实边界框信息;2)整个HCP架构对可能有噪声和/或冗余的假设具有鲁棒性;3)共享的CNN灵活且可以使用大规模单标签图像数据集(例如ImageNet)进行良好的预训练;4)它可以自然地输出多标签预测结果。在Pascal VOC 2007和VOC 2012多标签图像数据集上的实验结果很好地证明了所提出的HCP架构优于其他现有技术。特别是,仅通过HCP,在VOC 2012数据集上的平均精度均值(mAP)达到90.5%,与我们基于手工特征在[44]中的互补结果融合后达到93.2%。