King R D, Sternberg M J
Biomolecular Modelling Laboratory, Imperial Cancer Research Fund, London, United Kingdom.
Protein Sci. 1996 Nov;5(11):2298-310. doi: 10.1002/pro.5560051116.
A protein secondary structure prediction method from multiply aligned homologous sequences is presented with an overall per residue three-state accuracy of 70.1%. There are two aims: to obtain high accuracy by identification of a set of concepts important for prediction followed by use of linear statistics; and to provide insight into the folding process. The important concepts in secondary structure prediction are identified as: residue conformational propensities, sequence edge effects, moments of hydrophobicity, position of insertions and deletions in aligned homologous sequence, moments of conservation, auto-correlation, residue ratios, secondary structure feedback effects, and filtering. Explicit use of edge effects, moments of conservation, and auto-correlation are new to this paper. The relative importance of the concepts used in prediction was analyzed by stepwise addition of information and examination of weights in the discrimination function. The simple and explicit structure of the prediction allows the method to be reimplemented easily. The accuracy of a prediction is predictable a priori. This permits evaluation of the utility of the prediction: 10% of the chains predicted were identified correctly as having a mean accuracy of > 80%. Existing high-accuracy prediction methods are "black-box" predictors based on complex nonlinear statistics (e.g., neural networks in PHD: Rost & Sander, 1993a). For medium- to short-length chains (> or = 90 residues and < 170 residues), the prediction method is significantly more accurate (P < 0.01) than the PHD algorithm (probably the most commonly used algorithm). In combination with the PHD, an algorithm is formed that is significantly more accurate than either method, with an estimated overall three-state accuracy of 72.4%, the highest accuracy reported for any prediction method.
本文提出了一种基于多重比对同源序列的蛋白质二级结构预测方法,其每个残基的整体三态准确率为70.1%。该方法有两个目标:一是通过识别一组对预测重要的概念,然后使用线性统计来获得高精度;二是深入了解折叠过程。二级结构预测中的重要概念被确定为:残基构象倾向、序列边缘效应、疏水性矩、比对同源序列中插入和缺失的位置、保守性矩、自相关、残基比率、二级结构反馈效应和过滤。本文首次明确使用了边缘效应、保守性矩和自相关。通过逐步添加信息并检查判别函数中的权重,分析了预测中使用的概念的相对重要性。该预测方法结构简单明了,易于重新实现。预测的准确性可以先验预测。这允许评估预测的效用:预测的链中有10%被正确识别,其平均准确率>80%。现有的高精度预测方法是基于复杂非线性统计的“黑箱”预测器(例如,PHD中的神经网络:Rost和Sander,1993a)。对于中短长度的链(≥90个残基且<170个残基),该预测方法比PHD算法(可能是最常用的算法)显著更准确(P<0.01)。与PHD相结合,形成了一种比任何一种方法都显著更准确的算法,估计整体三态准确率为72.4%,这是任何预测方法所报道的最高准确率。