Department of Computing Science, University of Alberta, Edmonton, Canada.
IEEE/ACM Trans Comput Biol Bioinform. 2010 Oct-Dec;7(4):669-80. doi: 10.1109/TCBB.2008.99.
Modern biological applications usually involve the similarity comparison between two objects, which is often computationally very expensive, such as whole genome pairwise alignment and protein 3D structure alignment. Nevertheless, being able to quickly identify the closest neighboring objects from very large databases for a newly obtained sequence or structure can provide timely hints to its functions and more. This paper presents a substantial speedup technique for the well-studied k-nearest neighbor (k-nn) search, based on novel concepts of virtual pivots and partial pivots, such that a significant number of the expensive distance computations can be avoided. The new method is able to dynamically locate virtual pivots, according to the query, with increasing pruning ability. Using the same or less amount of database preprocessing effort, the new method outperformed the second best method by using no more than 40 percent distance computations per query, on a database of 10,000 gene sequences, compared to several best known k-nn search methods including M-Tree, OMNI, SA-Tree, and LAESA. We demonstrated the use of this method on two biological sequence data sets, one of which is for HIV-1 viral strain computational genotyping.
现代生物应用通常涉及两个对象之间的相似性比较,这通常在计算上非常昂贵,例如全基因组两两比对和蛋白质 3D 结构比对。然而,对于新获得的序列或结构,能够从非常大的数据库中快速识别出最接近的相邻对象,可以为其功能提供及时的提示,等等。本文提出了一种基于虚拟枢轴和部分枢轴的新概念的 k-最近邻(k-nn)搜索的实质性加速技术,从而可以避免大量昂贵的距离计算。新方法能够根据查询动态定位虚拟枢轴,从而提高修剪能力。在相同或更少的数据库预处理工作量下,与包括 M-Tree、OMNI、SA-Tree 和 LAESA 在内的几种最知名的 k-nn 搜索方法相比,新方法在 10000 个基因序列的数据库中,每个查询的距离计算不超过 40%,而性能优于第二好的方法。我们在两个生物序列数据集上展示了该方法的使用,其中一个数据集用于 HIV-1 病毒株计算基因分型。