Lesk A M
Department of Haematology, University of Cambridge Clinical School, United Kingdom.
Proteins. 1998 Nov 15;33(3):320-8.
In analysis, comparison and classification of conformations of proteins, a common computational task involves extractions of similar substructures. Structural comparisons are usually based on either of two measures of similarity: the root-mean-square (r.m.s.) deviation upon optimal superposition, or the maximal element of the difference distance matrix. The analysis presented here clarifies the relationships between different measures of structural similarity, and can provide a basis for developing algorithms and software to extract all maximal common well-fitting substructures from proteins. Given atomic coordinates of two proteins, many methods have been described for extracting some substantial (if not provably maximal) common substructure with low r.m.s. deviation. This is a relatively easy task compared with the problem addressed here, i.e., that of finding all common substructures with r.m.s. deviation less than a prespecified threshold. The combinatorial problems associated with similar subset extraction are more tractable if expressed in terms of the maximal element of the difference distance matrix than in terms of the r.m.s. deviation. However, it has been difficult to correlate these alternative measures of structural similarity. The purpose of this article is to make this connection. We first introduce a third measure of structural similarity: the maximum distance between corresponding pairs of points after superposition to minimize this value. This corresponds to fitting in the Chebyshev norm. Properties of Chebyshev superposition are derived. We describe relationships between the r.m.s. and minimax (Chebyshev) deviations upon optimal superposition, and between the Chebyshev deviation and the maximal element of the difference distance matrix. Combining these produces a relationship between the r.m.s. deviation upon optimal superposition and the maximal element of the difference distance matrix. Based on these results, we can apply algorithms and software for finding subsets of the difference distance matrix for which all elements are less than a specified bound, either to select only subsets for which the r.m.s.deviation is less than or equal to a specified threshold, or to select subsets that include all subsets for which the r.m.s. deviation is less than or equal to a threshold.
在对蛋白质构象进行分析、比较和分类时,一项常见的计算任务涉及提取相似的子结构。结构比较通常基于两种相似性度量中的一种:最优叠加后的均方根(r.m.s.)偏差,或差异距离矩阵的最大元素。本文所呈现的分析阐明了不同结构相似性度量之间的关系,并可为开发从蛋白质中提取所有最大公共适配良好子结构的算法和软件提供基础。给定两种蛋白质的原子坐标,已经描述了许多方法来提取具有低均方根偏差的一些实质(即使不是可证明最大)公共子结构。与这里所解决的问题相比,这是一项相对容易的任务,即找到所有均方根偏差小于预先指定阈值的公共子结构。如果用差异距离矩阵的最大元素来表示,与相似子集提取相关的组合问题比用均方根偏差来表示更易于处理。然而,一直难以关联这些结构相似性的替代度量。本文的目的就是建立这种联系。我们首先引入第三种结构相似性度量:叠加后对应点对之间的最大距离,以使该值最小化。这对应于切比雪夫范数下的拟合。推导了切比雪夫叠加的性质。我们描述了最优叠加时均方根偏差与极小极大(切比雪夫)偏差之间的关系,以及切比雪夫偏差与差异距离矩阵的最大元素之间的关系。将这些结合起来就得到了最优叠加时均方根偏差与差异距离矩阵的最大元素之间的关系。基于这些结果,我们可以应用用于找到差异距离矩阵中所有元素都小于指定界限的子集的算法和软件,要么仅选择均方根偏差小于或等于指定阈值的子集,要么选择包含所有均方根偏差小于或等于阈值的子集的子集。