Halperin E, Buhler J, Karp R, Krauthgamer R, Westover B
International Computer Science Institute and Computer Science Division, University of California, Berkeley, CA 94720, USA.
Bioinformatics. 2003;19 Suppl 1:i122-9. doi: 10.1093/bioinformatics/btg1016.
Comparing two protein databases is a fundamental task in biosequence annotation. Given two databases, one must find all pairs of proteins that align with high score under a biologically meaningful substitution score matrix, such as a BLOSUM matrix (Henikoff and Henikoff, 1992). Distance-based approaches to this problem map each peptide in the database to a point in a metric space, such that peptides aligning with higher scores are mapped to closer points. Many techniques exist to discover close pairs of points in a metric space efficiently, but the challenge in applying this work to proteomic comparison is to find a distance mapping that accurately encodes all the distinctions among residue pairs made by a proteomic score matrix. Buhler (2002) proposed one such mapping but found that it led to a relatively inefficient algorithm for protein-protein comparison.
This work proposes a new distance mapping for peptides under the BLOSUM matrices that permits more efficient similarity search. We first propose a new distance function on peptides derived from a given score matrix. We then show how to map peptides to bit vectors such that the distance between any two peptides is closely approximated by the Hamming distance (i.e. number of mismatches) between their corresponding bit vectors. We combine these two results with the LSH-ALL-PAIRS-SIM algorithm of Buhler (2002) to produce an improved distance-based algorithm for proteomic comparison. An initial implementation of the improved algorithm exhibits sensitivity within 5% of that of the original LSH-ALL-PAIRS-SIM, while running up to eight times faster.
比较两个蛋白质数据库是生物序列注释中的一项基本任务。给定两个数据库,必须在具有生物学意义的替换得分矩阵(例如BLOSUM矩阵(亨尼科夫和亨尼科夫,1992年))下找到所有具有高分比对的蛋白质对。基于距离的解决此问题的方法将数据库中的每个肽映射到度量空间中的一个点,使得比对得分更高的肽被映射到更近的点。存在许多技术可以有效地发现度量空间中的近点对,但将这项工作应用于蛋白质组比较的挑战在于找到一种距离映射,该映射能准确编码蛋白质组得分矩阵对残基对所做的所有区分。布勒(2002年)提出了一种这样的映射,但发现它导致了一种相对低效的蛋白质-蛋白质比较算法。
这项工作提出了一种在BLOSUM矩阵下用于肽的新距离映射,允许进行更有效的相似性搜索。我们首先提出一种基于给定得分矩阵的肽的新距离函数。然后我们展示如何将肽映射到位向量,使得任意两个肽之间的距离由它们相应位向量之间的汉明距离(即不匹配数)紧密近似。我们将这两个结果与布勒(2002年)的LSH - ALL - PAIRS - SIM算法相结合,以产生一种改进的基于距离的蛋白质组比较算法。改进算法的初始实现显示出的灵敏度在原始LSH - ALL - PAIRS - SIM的5%以内,同时运行速度快达八倍。