Caprara Alberto, Carr Robert, Istrail Sorin, Lancia Giuseppe, Walenz Brian
D.E.I.S., Università di Bologna, Viale Risorgimento, 2 40136 Bologna, Italy.
J Comput Biol. 2004;11(1):27-52. doi: 10.1089/106652704773416876.
Protein structure comparison is a fundamental problem for structural genomics, with applications to drug design, fold prediction, protein clustering, and evolutionary studies. Despite its importance, there are very few rigorous methods and widely accepted similarity measures known for this problem. In this paper we describe the last few years of developments on the study of an emerging measure, the contact map overlap (CMO), for protein structure comparison. A contact map is a list of pairs of residues which lie in three-dimensional proximity in the protein's native fold. Although this measure is in principle computationally hard to optimize, we show how it can in fact be computed with great accuracy for related proteins by integer linear programming techniques. These methods have the advantage of providing certificates of near-optimality by means of upper bounds to the optimal alignment value. We also illustrate effective heuristics, such as local search and genetic algorithms. We were able to obtain for the first time optimal alignments for large similar proteins (about 1,000 residues and 2,000 contacts) and used the CMO measure to cluster proteins in families. The clusters obtained were compared to SCOP classification in order to validate the measure. Extensive computational experiments showed that alignments which are off by at most 10% from the optimal value can be computed in a short time. Further experiments showed how this measure reacts to the choice of the threshold defining a contact and how to choose this threshold in a sensible way.
蛋白质结构比较是结构基因组学的一个基本问题,在药物设计、折叠预测、蛋白质聚类和进化研究等方面都有应用。尽管其很重要,但针对这个问题,已知的严格方法和被广泛接受的相似性度量却非常少。在本文中,我们描述了在一种新兴的用于蛋白质结构比较的度量——接触图重叠(CMO)研究方面过去几年的进展。接触图是蛋白质天然折叠中处于三维接近位置的残基对列表。尽管这种度量原则上在计算上难以优化,但我们展示了如何通过整数线性规划技术实际上以很高的精度对相关蛋白质进行计算。这些方法的优点是通过最优比对值的上界提供接近最优性的证明。我们还阐述了有效的启发式方法,如局部搜索和遗传算法。我们首次获得了大型相似蛋白质(约1000个残基和2000个接触点)的最优比对,并使用CMO度量对蛋白质家族进行聚类。将得到的聚类与SCOP分类进行比较以验证该度量。大量的计算实验表明,在短时间内可以计算出与最优值偏差最多10%的比对。进一步的实验表明了这种度量对定义接触的阈值选择的反应以及如何明智地选择这个阈值。