IEEE Trans Pattern Anal Mach Intell. 2017 Sep;39(9):1783-1796. doi: 10.1109/TPAMI.2016.2613873. Epub 2016 Sep 27.
Visual search and image retrieval underpin numerous applications, however the task is still challenging predominantly due to the variability of object appearance and ever increasing size of the databases, often exceeding billions of images. Prior art methods rely on aggregation of local scale-invariant descriptors, such as SIFT, via mechanisms including Bag of Visual Words (BoW), Vector of Locally Aggregated Descriptors (VLAD) and Fisher Vectors (FV). However, their performance is still short of what is required. This paper presents a novel method for deriving a compact and distinctive representation of image content called Robust Visual Descriptor with Whitening (RVD-W). It significantly advances the state of the art and delivers world-class performance. In our approach local descriptors are rank-assigned to multiple clusters. Residual vectors are then computed in each cluster, normalized using a direction-preserving normalization function and aggregated based on the neighborhood rank. Importantly, the residual vectors are de-correlated and whitened in each cluster before aggregation, leading to a balanced energy distribution in each dimension and significantly improved performance. We also propose a new post-PCA normalization approach which improves separability between the matching and non-matching global descriptors. This new normalization benefits not only our RVD-W descriptor but also improves existing approaches based on FV and VLAD aggregation. Furthermore, we show that the aggregation framework developed using hand-crafted SIFT features also performs exceptionally well with Convolutional Neural Network (CNN) based features. The RVD-W pipeline outperforms state-of-the-art global descriptors on both the Holidays and Oxford datasets. On the large scale datasets, Holidays1M and Oxford1M, SIFT-based RVD-W representation obtains a mAP of 45.1 and 35.1 percent, while CNN-based RVD-W achieve a mAP of 63.5 and 44.8 percent, all yielding superior performance to the state-of-the-art.
视觉搜索和图像检索是许多应用的基础,然而,由于目标外观的可变性和数据库规模的不断扩大(通常超过数十亿张图像),这个任务仍然具有挑战性。先前的艺术方法依赖于通过机制(例如,视觉单词袋(BoW)、局部聚集描述符向量(VLAD)和 Fisher 向量(FV))聚合局部尺度不变描述符,如 SIFT。然而,它们的性能仍然不如要求的那样好。本文提出了一种新的方法,用于从图像内容中提取紧凑而独特的表示,称为带白化的稳健视觉描述符(RVD-W)。它显著提高了现有技术的水平,并提供了世界级的性能。在我们的方法中,局部描述符被分配给多个聚类。然后在每个聚类中计算残差向量,使用保持方向的归一化函数进行归一化,并根据邻域等级进行聚合。重要的是,在聚合之前,在每个聚类中对残差向量进行去相关和白化,从而在每个维度上实现平衡的能量分布,并显著提高性能。我们还提出了一种新的 PCA 后归一化方法,该方法提高了匹配和非匹配全局描述符之间的可分离性。这种新的归一化不仅有利于我们的 RVD-W 描述符,而且还可以改进基于 FV 和 VLAD 聚合的现有方法。此外,我们表明,使用手工制作的 SIFT 特征开发的聚合框架也可以与基于卷积神经网络(CNN)的特征表现得非常出色。RVD-W 流水线在 Holidays 和 Oxford 数据集上都优于最先进的全局描述符。在大规模数据集 Holidays1M 和 Oxford1M 上,基于 SIFT 的 RVD-W 表示的 mAP 分别为 45.1%和 35.1%,而基于 CNN 的 RVD-W 则达到了 63.5%和 44.8%,所有这些都优于最先进的技术。