Business Analytics and Mathematical Sciences Department, IBM T.J. Watson Research Center, RM 31-229, 1101 Kitchawan Rd, Rte. 134, Yorktown Heights, NY 10598, USA.
IEEE Trans Pattern Anal Mach Intell. 2012 Dec;34(12):2393-406. doi: 10.1109/TPAMI.2012.48.
Hashing-based approximate nearest neighbor (ANN) search in huge databases has become popular due to its computational and memory efficiency. The popular hashing methods, e.g., Locality Sensitive Hashing and Spectral Hashing, construct hash functions based on random or principal projections. The resulting hashes are either not very accurate or are inefficient. Moreover, these methods are designed for a given metric similarity. On the contrary, semantic similarity is usually given in terms of pairwise labels of samples. There exist supervised hashing methods that can handle such semantic similarity, but they are prone to overfitting when labeled data are small or noisy. In this work, we propose a semi-supervised hashing (SSH) framework that minimizes empirical error over the labeled set and an information theoretic regularizer over both labeled and unlabeled sets. Based on this framework, we present three different semi-supervised hashing methods, including orthogonal hashing, nonorthogonal hashing, and sequential hashing. Particularly, the sequential hashing method generates robust codes in which each hash function is designed to correct the errors made by the previous ones. We further show that the sequential learning paradigm can be extended to unsupervised domains where no labeled pairs are available. Extensive experiments on four large datasets (up to 80 million samples) demonstrate the superior performance of the proposed SSH methods over state-of-the-art supervised and unsupervised hashing techniques.
基于哈希的近似最近邻 (ANN) 在大型数据库中的搜索由于其计算和内存效率而变得流行。流行的哈希方法,例如局部敏感哈希和谱哈希,基于随机或主投影构建哈希函数。得到的哈希要么不是很准确,要么效率不高。此外,这些方法是针对给定的度量相似性设计的。相反,语义相似性通常是根据样本的成对标签来表示的。存在一些监督哈希方法可以处理这种语义相似性,但当标记数据较少或有噪声时,它们容易出现过拟合。在这项工作中,我们提出了一种半监督哈希 (SSH) 框架,该框架在标记集上最小化经验误差,在标记集和未标记集上最小化信息论正则化项。基于这个框架,我们提出了三种不同的半监督哈希方法,包括正交哈希、非正交哈希和顺序哈希。特别是,顺序哈希方法生成了鲁棒的代码,其中每个哈希函数都被设计用来纠正前一个哈希函数的错误。我们进一步表明,顺序学习范式可以扩展到没有可用标记对的无监督领域。在四个大型数据集(多达 8000 万个样本)上的广泛实验表明,所提出的 SSH 方法在最先进的监督和无监督哈希技术方面具有优越的性能。