Schmidler S C, Liu J S, Brutlag D L
Section on Medical Informatics, Stanford University School of Medicine, CA 94305, USA.
J Comput Biol. 2000 Feb-Apr;7(1-2):233-48. doi: 10.1089/10665270050081496.
We present a novel method for predicting the secondary structure of a protein from its amino acid sequence. Most existing methods predict each position in turn based on a local window of residues, sliding this window along the length of the sequence. In contrast, we develop a probabilistic model of protein sequence/structure relationships in terms of structural segments, and formulate secondary structure prediction as a general Bayesian inference problem. A distinctive feature of our approach is the ability to develop explicit probabilistic models for alpha-helices, beta-strands, and other classes of secondary structure, incorporating experimentally and empirically observed aspects of protein structure such as helical capping signals, side chain correlations, and segment length distributions. Our model is Markovian in the segments, permitting efficient exact calculation of the posterior probability distribution over all possible segmentations of the sequence using dynamic programming. The optimal segmentation is computed and compared to a predictor based on marginal posterior modes, and the latter is shown to provide significant improvement in predictive accuracy. The marginalization procedure provides exact secondary structure probabilities at each sequence position, which are shown to be reliable estimates of prediction uncertainty. We apply this model to a database of 452 nonhomologous structures, achieving accuracies as high as the best currently available methods. We conclude by discussing an extension of this framework to model nonlocal interactions in protein structures, providing a possible direction for future improvements in secondary structure prediction accuracy.
我们提出了一种从蛋白质氨基酸序列预测其二级结构的新方法。大多数现有方法基于残基的局部窗口依次预测每个位置,并沿序列长度滑动此窗口。相比之下,我们根据结构片段建立了蛋白质序列/结构关系的概率模型,并将二级结构预测表述为一个通用的贝叶斯推理问题。我们方法的一个显著特点是能够为α螺旋、β链和其他二级结构类别建立明确的概率模型,纳入诸如螺旋封端信号、侧链相关性和片段长度分布等蛋白质结构的实验和经验观察方面。我们的模型在片段上是马尔可夫的,允许使用动态规划对序列的所有可能分割高效精确地计算后验概率分布。计算出最优分割并与基于边际后验模式的预测器进行比较,结果表明后者在预测准确性上有显著提高。边缘化过程在每个序列位置提供精确的二级结构概率,这些概率被证明是预测不确定性的可靠估计。我们将此模型应用于一个包含452个非同源结构的数据库,获得了与目前最佳方法一样高的准确率。最后,我们讨论了将此框架扩展以对蛋白质结构中的非局部相互作用进行建模,为未来提高二级结构预测准确性提供了一个可能的方向。