IEEE Trans Pattern Anal Mach Intell. 2015 Sep;37(9):1904-16. doi: 10.1109/TPAMI.2015.2389824.
Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224 × 224) input image. This requirement is "artificial" and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with another pooling strategy, "spatial pyramid pooling", to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. Pyramid pooling is also robust to object deformations. With these advantages, SPP-net should in general improve all CNN-based image classification methods. On the ImageNet 2012 dataset, we demonstrate that SPP-net boosts the accuracy of a variety of CNN architectures despite their different designs. On the Pascal VOC 2007 and Caltech101 datasets, SPP-net achieves state-of-the-art classification results using a single full-image representation and no fine-tuning. The power of SPP-net is also significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method is 24-102 × faster than the R-CNN method, while achieving better or comparable accuracy on Pascal VOC 2007. In ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, our methods rank #2 in object detection and #3 in image classification among all 38 teams. This manuscript also introduces the improvement made for this competition.
现有的深度卷积神经网络(CNN)需要固定大小(例如 224×224)的输入图像。这种要求是“人为的”,可能会降低任意大小/比例的图像或子图像的识别精度。在这项工作中,我们为网络配备了另一种池化策略“空间金字塔池化”,以消除上述要求。新的网络结构称为 SPP-net,可以生成固定长度的表示,而与图像大小/比例无关。金字塔池化对物体变形也具有鲁棒性。具有这些优势,SPP-net 通常应该可以提高所有基于 CNN 的图像分类方法的性能。在 ImageNet 2012 数据集上,我们证明 SPP-net 可以提高各种 CNN 架构的准确性,尽管它们的设计不同。在 Pascal VOC 2007 和 Caltech101 数据集上,SPP-net 使用单个全图像表示和无需微调即可实现最先进的分类结果。SPP-net 的功能在目标检测中也非常重要。使用 SPP-net,我们只需计算一次整个图像的特征图,然后在任意区域(子图像)中进行特征池化,以生成固定长度的表示,用于训练检测器。这种方法避免了重复计算卷积特征。在处理测试图像时,我们的方法比 R-CNN 方法快 24-102 倍,同时在 Pascal VOC 2007 上实现了更好或相当的准确性。在 2014 年的大规模视觉识别挑战赛(ILSVRC)中,我们的方法在所有 38 个团队中在目标检测中排名第 2,在图像分类中排名第 3。本文档还介绍了为本次竞赛所做的改进。