Rice D W, Eisenberg D
UCLA-DOE Laboratory of Structural Biology and Molecular Medicine, Molecular Biology Institute, UCLA, Los Angeles, CA 90095-1570, USA.
J Mol Biol. 1997 Apr 11;267(4):1026-38. doi: 10.1006/jmbi.1997.0924.
In protein fold recognition, a probe amino acid sequence is compared to a library of representative folds of known structure to identify a structural homolog. In cases where the probe and its homolog have clear sequence similarity, traditional residue substitution matrices have been used to predict the structural similarity. In cases where the probe is sequentially distant from its homolog, we have developed a (7 x 3 x 2 x 7 x 3) 3D-1D substitution matrix (called H3P2), calculated from a database of 119 structural pairs. Members of each pair share a similar fold, but have sequence identity less than 30%. Each probe sequence position is defined by one of seven residue classes and three secondary structure classes. Each homologous fold position is defined by one of seven residue classes, three secondary structure classes, and two burial classes. Thus the matrix is five-dimensional and contains 7 x 3 x 2 x 7 x 3 = 882 elements or 3D-1D scores. The first step in assigning a probe sequence to its homologous fold is the prediction of the three-state (helix, strand, coil) secondary structure of the probe; here we use the profile based neural network prediction of secondary structure (PHD) program. Then a dynamic programming algorithm uses the H3P2 matrix to align the probe sequence with structures in a representative fold library. To test the effectiveness of the H3P2 matrix a challenging, fold class diverse, and cross-validated benchmark assessment is used to compare the H3P2 matrix to the GONNET, PAM250, BLOSUM62 and a secondary structure only substitution matrix. For distantly related sequences the H3P2 matrix detects more homologous structures at higher reliabilities than do these other substitution matrices, based on sensitivity versus specificity plots (or SENS-SPEC plots). The added efficacy of the H3P2 matrix arises from its information on the statistical preferences for various sequence-structure environment combinations from very distantly related proteins. It introduces the predicted secondary structure information from a sequence into fold recognition in a statistical way that normalizes the inherent correlations between residue type, secondary structure and solvent accessibility.
在蛋白质折叠识别中,将一条探测氨基酸序列与已知结构的代表性折叠文库进行比较,以识别结构同源物。在探测序列与其同源物具有明显序列相似性的情况下,传统的残基替换矩阵已被用于预测结构相似性。在探测序列与其同源物在序列上距离较远的情况下,我们开发了一种(7×3×2×7×3)三维-一维替换矩阵(称为H3P2),它是根据119个结构对的数据库计算得出的。每对结构的成员具有相似的折叠,但序列同一性小于30%。每个探测序列位置由七种残基类别和三种二级结构类别之一定义。每个同源折叠位置由七种残基类别、三种二级结构类别和两种埋藏类别之一定义。因此,该矩阵是五维的,包含7×3×2×7×3 = 882个元素或三维-一维得分。将探测序列与其同源折叠进行匹配的第一步是预测探测序列的三态(螺旋、链、无规卷曲)二级结构;这里我们使用基于轮廓的神经网络二级结构预测(PHD)程序。然后,一种动态规划算法使用H3P2矩阵将探测序列与代表性折叠文库中的结构进行比对。为了测试H3P2矩阵的有效性,使用了一个具有挑战性、折叠类别多样且经过交叉验证的基准评估,将H3P2矩阵与GONNET、PAM250、BLOSUM62以及仅基于二级结构的替换矩阵进行比较。对于远缘相关序列,基于敏感性与特异性图(或SENS-SPEC图),H3P2矩阵比其他这些替换矩阵能以更高的可靠性检测到更多的同源结构。H3P2矩阵额外的有效性源于其关于来自远缘相关蛋白质的各种序列-结构环境组合的统计偏好信息。它以一种统计方式将来自序列的预测二级结构信息引入折叠识别,从而对残基类型、二级结构和溶剂可及性之间的内在相关性进行归一化。