Stolorz P, Lapedes A, Xia Y
Theoretical Division, Los Alamos National Laboratory, NM 87545.
J Mol Biol. 1992 May 20;225(2):363-77. doi: 10.1016/0022-2836(92)90927-c.
A comparison of neural network methods and Bayesian statistical methods is presented for prediction of the secondary structure of proteins given their primary sequence. The Bayesian method makes the unphysical assumption that the probability of an amino acid occurring in each position in the protein is independent of the amino acids occurring elsewhere. However, we find the predictive accuracy of the Bayesian method to be only minimally less than the accuracy of the most sophisticated methods used to date. We present the relationship of neural network methods to Bayesian statistical methods and show that, in principle, neural methods offer considerable power, although apparently they are not particularly useful for this problem. In the process, we derive a neural formalism in which the output neurons directly represent the conditional probabilities of structure class. The probabilistic formalism allows introduction of a new objective function, the mutual information, which translates the notion of correlation as a measure of predictive accuracy into a useful training measure. Although a similar accuracy to other approaches (utilizing a mean-square error) is achieved using this new measure, the accuracy on the training set is significantly and tantalizingly higher, even though the number of adjustable parameters remains the same. The mutual information measure predicts a greater fraction of helix and sheet structures correctly than the mean-square error measure, at the expense of coil accuracy, precisely as it was designed to do. By combining the two objective functions, we obtain a marginally improved accuracy of 64.4%, with Matthews coefficients C alpha, C beta and Ccoil of 0.40, 0.32 and 0.42, respectively. However, since all methods to date perform only slightly better than the Bayes algorithm, which entails the drastic assumption of independence of amino acids, one is forced to conclude that little progress has been made on this problem, despite the application of a variety of sophisticated algorithms such as neural networks, and that further advances will require a better understanding of the relevant biophysics.
针对给定蛋白质一级序列预测其二级结构的问题,本文对神经网络方法和贝叶斯统计方法进行了比较。贝叶斯方法做出了一个不符合实际的假设,即蛋白质中每个位置出现氨基酸的概率与其他位置出现的氨基酸无关。然而,我们发现贝叶斯方法的预测准确率仅略低于迄今为止使用的最复杂方法的准确率。我们阐述了神经网络方法与贝叶斯统计方法的关系,并表明原则上神经网络方法具有相当大的能力,尽管显然它们对这个问题并不是特别有用。在此过程中,我们推导了一种神经形式体系,其中输出神经元直接表示结构类别的条件概率。概率形式体系允许引入一个新的目标函数——互信息,它将作为预测准确率度量的相关性概念转化为一种有用的训练度量。尽管使用这种新度量获得的准确率与其他方法(利用均方误差)相似,但训练集上的准确率显著且诱人地更高,即使可调整参数的数量保持不变。互信息度量比均方误差度量能更准确地预测更大比例的螺旋和片状结构,代价是对卷曲结构的预测准确率降低,这正是它的设计目的。通过结合这两个目标函数,我们获得了略有提高的准确率,为64.4%,马修斯系数Cα、Cβ和C卷曲分别为0.40、0.32和0.42。然而,由于迄今为止所有方法的表现仅比贝叶斯算法略好,而贝叶斯算法需要氨基酸独立性这一极端假设,所以人们不得不得出结论,尽管应用了各种复杂算法,如神经网络,但在这个问题上几乎没有取得进展,进一步的进展将需要更好地理解相关的生物物理学。