在线性时间内搜索蛋白质三维结构。

Human Genome Center, Institute of Medical Science, University of Tokyo, Tokyo, Japan.

J Comput Biol. 2010 Mar;17(3):203-19. doi: 10.1089/cmb.2009.0148.

One of the most important issues in the post-genomic molecular biology is the analysis of protein three-dimensional (3-D) structures, and searching over the 3-D structure databases of them is becoming more and more important. The root mean square deviation (RMSD) is the most popular similarity measure for comparing two molecular structures. In this article, we propose new theoretically and practically fast algorithms for the basic problem of finding all the substructures of structures in a structure database of chain molecules (such as proteins), whose RMSDs to the query are within a given constant threshold. The best-known worst-case time complexity for the problem is O(N log m), where N is the database size and m is the query size. The previous best-known expected time complexity for the problem is also O(N log m). We also propose a new breakthrough linear-expected-time algorithm. It is not only a theoretically significant improvement over previous algorithms, but also a practically faster algorithm, according to computational experiments. Our experiments over the whole Protein Data Bank (PDB) database show that our algorithm is 3.6-28 times faster than previously known algorithms, to search for similar substructures whose RMSDs are within 1A to queries of ordinary lengths. We also propose a series of preprocessing algorithms that enable faster queries, though there have been no known indexing algorithm whose query time complexity is better than the above O(N log m) bound. One is an O(N log(2)N)-time and O(N log N)-space preprocessing algorithm with expected query time complexity of O(m + N given complex square root of m). Another is an O(N log N)-time and O(N)-space preprocessing algorithm with expected query time complexity of O(N given complex square root of m + m log (N given m)).(1)

后基因组分子生物学中最重要的问题之一是蛋白质三维（3-D）结构分析，在其3-D结构数据库中进行搜索变得越来越重要。均方根偏差（RMSD）是比较两个分子结构时最常用的相似性度量。在本文中，我们针对在链状分子（如蛋白质）的结构数据库中查找与查询结构的RMSD在给定常数阈值内的所有子结构这一基本问题，提出了理论上和实践上都快速的新算法。该问题最著名的最坏情况时间复杂度为O(N log m)，其中N是数据库大小，m是查询大小。此前该问题最著名的期望时间复杂度也是O(N log m)。我们还提出了一种新的突破性线性期望时间算法。根据计算实验，它不仅在理论上比以前的算法有显著改进，而且在实践中也是更快的算法。我们在整个蛋白质数据库（PDB）上的实验表明，对于搜索RMSD在1埃以内且长度为普通长度的查询的相似子结构，我们的算法比以前已知的算法快3.6到28倍。我们还提出了一系列预处理算法，尽管目前还没有已知的索引算法其查询时间复杂度优于上述O(N log m)界限，但这些算法能实现更快的查询。一种是时间复杂度为O(N log(2)N)且空间复杂度为O(N log N)的预处理算法，期望查询时间复杂度为O(m + N给定m的复平方根)。另一种是时间复杂度为O(N log N)且空间复杂度为O(N)的预处理算法，期望查询时间复杂度为O(N给定m的复平方根 + m log (N给定m))。(1)