IEEE Trans Pattern Anal Mach Intell. 2018 Jun;40(6):1437-1451. doi: 10.1109/TPAMI.2017.2711011. Epub 2017 Jun 1.
We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph. We present the following four principal contributions. First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task. The main component of this architecture, NetVLAD, is a new generalized VLAD layer, inspired by the "Vector of Locally Aggregated Descriptors" image representation commonly used in image retrieval. The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation. Second, we create a new weakly supervised ranking loss, which enables end-to-end learning of the architecture's parameters from images depicting the same places over time downloaded from Google Street View Time Machine. Third, we develop an efficient training procedure which can be applied on very large-scale weakly labelled tasks. Finally, we show that the proposed architecture and training procedure significantly outperform non-learnt image representations and off-the-shelf CNN descriptors on challenging place recognition and image retrieval benchmarks.
我们解决了大规模视觉场所识别的问题,该任务是快速准确地识别给定查询照片的位置。我们提出了以下四个主要贡献。首先,我们开发了一种卷积神经网络(CNN)架构,可以直接针对场所识别任务进行端到端训练。该架构的主要组件 NetVLAD 是一种新的广义 VLAD 层,它受到了在图像检索中常用的“局部聚集描述符向量”图像表示的启发。该层可轻松插入任何 CNN 架构中,并可通过反向传播进行训练。其次,我们创建了一个新的弱监督排序损失,该损失可以从从 Google Street View Time Machine 下载的随时间描绘相同地点的图像中,对架构的参数进行端到端学习。第三,我们开发了一种有效的训练过程,可以应用于非常大规模的弱标签任务。最后,我们表明,在所提出的架构和训练过程在具有挑战性的场所识别和图像检索基准上,明显优于非学习的图像表示和现成的 CNN 描述符。