Department of Computer Science, Bioinformatics Group, Centre for Computational Statistics and Machine Learning, University College London, Malet Place, London WC1E 6BT, UK.
Bioinformatics. 2012 Jan 15;28(2):184-90. doi: 10.1093/bioinformatics/btr638. Epub 2011 Nov 17.
The accurate prediction of residue-residue contacts, critical for maintaining the native fold of a protein, remains an open problem in the field of structural bioinformatics. Interest in this long-standing problem has increased recently with algorithmic improvements and the rapid growth in the sizes of sequence families. Progress could have major impacts in both structure and function prediction to name but two benefits. Sequence-based contact predictions are usually made by identifying correlated mutations within multiple sequence alignments (MSAs), most commonly through the information-theoretic approach of calculating mutual information between pairs of sites in proteins. These predictions are often inaccurate because the true covariation signal in the MSA is often masked by biases from many ancillary indirect-coupling or phylogenetic effects. Here we present a novel method, PSICOV, which introduces the use of sparse inverse covariance estimation to the problem of protein contact prediction. Our method builds on work which had previously demonstrated corrections for phylogenetic and entropic correlation noise and allows accurate discrimination of direct from indirectly coupled mutation correlations in the MSA.
PSICOV displays a mean precision substantially better than the best performing normalized mutual information approach and Bayesian networks. For 118 out of 150 targets, the L/5 (i.e. top-L/5 predictions for a protein of length L) precision for long-range contacts (sequence separation >23) was ≥ 0.5, which represents an improvement sufficient to be of significant benefit in protein structure prediction or model quality assessment.
The PSICOV source code can be downloaded from http://bioinf.cs.ucl.ac.uk/downloads/PSICOV.
残基残基接触的准确预测对于维持蛋白质的天然折叠至关重要,这仍然是结构生物信息学领域的一个未解决的问题。随着算法的改进和序列家族规模的快速增长,人们对这个长期存在的问题的兴趣最近有所增加。这一进展可能对结构和功能预测产生重大影响,仅举两个好处。基于序列的接触预测通常是通过在多个序列比对(MSA)中识别相关突变来实现的,最常见的方法是通过计算蛋白质中对位点之间的互信息来计算信息论方法。这些预测通常不准确,因为 MSA 中的真实共变信号经常被许多辅助间接耦合或系统发育效应的偏差所掩盖。在这里,我们提出了一种新的方法 PSICOV,它将稀疏逆协方差估计引入到蛋白质接触预测问题中。我们的方法基于先前已经证明的用于校正系统发育和熵相关噪声的工作,并允许在 MSA 中准确区分直接和间接耦合突变相关性。
PSICOV 的平均精度明显优于表现最好的归一化互信息方法和贝叶斯网络。对于 150 个目标中的 118 个,长程接触(序列分离>23)的 L/5(即蛋白质长度 L 的前 L/5 个预测)精度≥0.5,这足以显著提高蛋白质结构预测或模型质量评估的精度。
PSICOV 的源代码可以从 http://bioinf.cs.ucl.ac.uk/downloads/PSICOV 下载。