IEEE Trans Pattern Anal Mach Intell. 2019 Jul;41(7):1655-1668. doi: 10.1109/TPAMI.2018.2846566. Epub 2018 Jun 12.
Image descriptors based on activations of Convolutional Neural Networks (CNNs) have become dominant in image retrieval due to their discriminative power, compactness of representation, and search efficiency. Training of CNNs, either from scratch or fine-tuning, requires a large amount of annotated data, where a high quality of annotation is often crucial. In this work, we propose to fine-tune CNNs for image retrieval on a large collection of unordered images in a fully automated manner. Reconstructed 3D models obtained by the state-of-the-art retrieval and structure-from-motion methods guide the selection of the training data. We show that both hard-positive and hard-negative examples, selected by exploiting the geometry and the camera positions available from the 3D models, enhance the performance of particular-object retrieval. CNN descriptor whitening discriminatively learned from the same training data outperforms commonly used PCA whitening. We propose a novel trainable Generalized-Mean (GeM) pooling layer that generalizes max and average pooling and show that it boosts retrieval performance. Applying the proposed method to the VGG network achieves state-of-the-art performance on the standard benchmarks: Oxford Buildings, Paris, and Holidays datasets.
基于卷积神经网络 (CNN) 激活的图像描述符由于其判别能力、表示的紧凑性和搜索效率,在图像检索中占据主导地位。CNN 的训练,无论是从头开始还是微调,都需要大量的标注数据,而标注的质量往往至关重要。在这项工作中,我们提出了一种完全自动化的方法,通过对大量无序图像进行微调来进行图像检索。通过最先进的检索和运动结构方法获得的重建 3D 模型指导训练数据的选择。我们表明,通过利用 3D 模型中可用的几何形状和相机位置选择的硬正例和硬负例,增强了特定对象检索的性能。从相同的训练数据中学习到的具有判别力的 CNN 描述符白化优于常用的 PCA 白化。我们提出了一种新的可训练的广义均值 (GeM) 池化层,它可以推广最大池化和平均池化,并表明它可以提高检索性能。将所提出的方法应用于 VGG 网络在标准基准(牛津建筑、巴黎和假日数据集)上实现了最先进的性能。