Vullo Alessandro, Walsh Ian, Pollastri Gianluca
School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland.
BMC Bioinformatics. 2006 Mar 30;7:180. doi: 10.1186/1471-2105-7-180.
Protein topology representations such as residue contact maps are an important intermediate step towards ab initio prediction of protein structure. Although improvements have occurred over the last years, the problem of accurately predicting residue contact maps from primary sequences is still largely unsolved. Among the reasons for this are the unbalanced nature of the problem (with far fewer examples of contacts than non-contacts), the formidable challenge of capturing long-range interactions in the maps, the intrinsic difficulty of mapping one-dimensional input sequences into two-dimensional output maps. In order to alleviate these problems and achieve improved contact map predictions, in this paper we split the task into two stages: the prediction of a map's principal eigenvector (PE) from the primary sequence; the reconstruction of the contact map from the PE and primary sequence. Predicting the PE from the primary sequence consists in mapping a vector into a vector. This task is less complex than mapping vectors directly into two-dimensional matrices since the size of the problem is drastically reduced and so is the scale length of interactions that need to be learned.
We develop architectures composed of ensembles of two-layered bidirectional recurrent neural networks to classify the components of the PE in 2, 3 and 4 classes from protein primary sequence, predicted secondary structure, and hydrophobicity interaction scales. Our predictor, tested on a non redundant set of 2171 proteins, achieves classification performances of up to 72.6%, 16% above a base-line statistical predictor. We design a system for the prediction of contact maps from the predicted PE. Our results show that predicting maps through the PE yields sizeable gains especially for long-range contacts which are particularly critical for accurate protein 3D reconstruction. The final predictor's accuracy on a non-redundant set of 327 targets is 35.4% and 19.8% for minimum contact separations of 12 and 24, respectively, when the top length/5 contacts are selected. On the 11 CASP6 Novel Fold targets we achieve similar accuracies (36.5% and 19.7%). This favourably compares with the best automated predictors at CASP6.
Our final system for contact map prediction achieves state-of-the-art performances, and may provide valuable constraints for improved ab initio prediction of protein structures. A suite of predictors of structural features, including the PE, and PE-based contact maps, is available at http://distill.ucd.ie.
诸如残基接触图之类的蛋白质拓扑表示是从头预测蛋白质结构的重要中间步骤。尽管在过去几年中有所改进,但从一级序列准确预测残基接触图的问题仍在很大程度上未得到解决。造成这种情况的原因包括问题的不平衡性质(接触的例子比非接触的例子少得多)、在图中捕捉长程相互作用的巨大挑战、将一维输入序列映射到二维输出图的内在困难。为了缓解这些问题并实现改进的接触图预测,在本文中我们将任务分为两个阶段:从一级序列预测图的主特征向量(PE);从PE和一级序列重建接触图。从一级序列预测PE在于将一个向量映射到一个向量。此任务比直接将向量映射到二维矩阵的复杂度更低,因为问题的规模大幅减小,需要学习的相互作用的尺度长度也减小了。
我们开发了由两层双向递归神经网络组成的集成架构,以根据蛋白质一级序列、预测的二级结构和疏水性相互作用尺度将PE的组成部分分类为2、3和4类。我们的预测器在一组2171个非冗余蛋白质上进行测试,实现了高达71.6%的分类性能,比基线统计预测器高出16%。我们设计了一个从预测的PE预测接触图的系统。我们的结果表明,通过PE预测图尤其对于长程接触有显著提升,而长程接触对于准确的蛋白质三维重建尤为关键。当选择前长度/5的接触时,最终预测器在一组327个非冗余目标上对于最小接触间距为12和24时的准确率分别为35.4%和19.8%。在11个CASP6新型折叠目标上我们实现了类似的准确率(36.5%和19.7%)。这与CASP6中最好的自动预测器相比具有优势。
我们用于接触图预测的最终系统实现了最先进的性能,并可能为改进蛋白质结构的从头预测提供有价值的约束。一套包括PE和基于PE的接触图的结构特征预测器可在http://distill.ucd.ie上获取。