Liu Xianglong, Fu Qiang, Wang Deqing, Bai Xiao, Wu Xinyu, Tao Dacheng
IEEE Trans Neural Netw Learn Syst. 2020 Dec;31(12):5312-5323. doi: 10.1109/TNNLS.2020.2965992. Epub 2020 Nov 30.
Building multiple hash tables serves as a very successful technique for gigantic data indexing, which can simultaneously guarantee both the search accuracy and efficiency. However, most of existing multitable indexing solutions, without informative hash codes and strong table complementarity, largely suffer from the table redundancy. To address the problem, we propose a complementary binary quantization (CBQ) method for jointly learning multiple tables and the corresponding informative hash functions in a centralized way. Based on CBQ, we further design a distributed learning algorithm (D-CBQ) to accelerate the training over the large-scale distributed data set. The proposed (D-)CBQ exploits the power of prototype-based incomplete binary coding to well align the data distributions in the original space and the Hamming space and further utilizes the nature of multi-index search to jointly reduce the quantization loss. (D-)CBQ possesses several attractive properties, including the extensibility for generating long hash codes in the product space and the scalability with linear training time. Extensive experiments on two popular large-scale tasks, including the Euclidean and semantic nearest neighbor search, demonstrate that the proposed (D-)CBQ enjoys efficient computation, informative binary quantization, and strong table complementarity, which together help significantly outperform the state of the arts, with up to 57.76% performance gains relatively.
构建多个哈希表是一种非常成功的用于海量数据索引的技术,它可以同时保证搜索的准确性和效率。然而,现有的大多多表索引解决方案由于缺乏信息丰富的哈希码和强大的表互补性,在很大程度上存在表冗余问题。为了解决这个问题,我们提出了一种互补二进制量化(CBQ)方法,用于以集中方式联合学习多个表和相应的信息丰富的哈希函数。基于CBQ,我们进一步设计了一种分布式学习算法(D-CBQ),以加速在大规模分布式数据集上的训练。所提出的(D-)CBQ利用基于原型的不完全二进制编码的能力,使原始空间和汉明空间中的数据分布良好对齐,并进一步利用多索引搜索的特性来联合减少量化损失。(D-)CBQ具有几个吸引人的特性,包括在乘积空间中生成长哈希码的可扩展性和线性训练时间的可扩展性。在包括欧几里得和语义最近邻搜索在内的两个流行的大规模任务上进行的大量实验表明,所提出的(D-)CBQ具有高效的计算、信息丰富的二进制量化和强大的表互补性,这些共同有助于显著超越现有技术水平,相对性能提升高达57.76%。