Walsh Ian, Baù Davide, Martin Alberto J M, Mooney Catherine, Vullo Alessandro, Pollastri Gianluca
School of Computer Science and Informatics, University College Dublin, Dublin, Ireland.
BMC Struct Biol. 2009 Jan 30;9:5. doi: 10.1186/1472-6807-9-5.
Prediction of protein structures from their sequences is still one of the open grand challenges of computational biology. Some approaches to protein structure prediction, especially ab initio ones, rely to some extent on the prediction of residue contact maps. Residue contact map predictions have been assessed at the CASP competition for several years now. Although it has been shown that exact contact maps generally yield correct three-dimensional structures, this is true only at a relatively low resolution (3-4 A from the native structure). Another known weakness of contact maps is that they are generally predicted ab initio, that is not exploiting information about potential homologues of known structure.
We introduce a new class of distance restraints for protein structures: multi-class distance maps. We show that C alpha trace reconstructions based on 4-class native maps are significantly better than those from residue contact maps. We then build two predictors of 4-class maps based on recursive neural networks: one ab initio, or relying on the sequence and on evolutionary information; one template-based, or in which homology information to known structures is provided as a further input. We show that virtually any level of sequence similarity to structural templates (down to less than 10%) yields more accurate 4-class maps than the ab initio predictor. We show that template-based predictions by recursive neural networks are consistently better than the best template and than a number of combinations of the best available templates. We also extract binary residue contact maps at an 8 A threshold (as per CASP assessment) from the 4-class predictors and show that the template-based version is also more accurate than the best template and consistently better than the ab initio one, down to very low levels of sequence identity to structural templates. Furthermore, we test both ab-initio and template-based 8 A predictions on the CASP7 targets using a pre-CASP7 PDB, and find that both predictors are state-of-the-art, with the template-based one far outperforming the best CASP7 systems if templates with sequence identity to the query of 10% or better are available. Although this is not the main focus of this paper we also report on reconstructions of C alpha traces based on both ab initio and template-based 4-class map predictions, showing that the latter are generally more accurate even when homology is dubious.
Accurate predictions of multi-class maps may provide valuable constraints for improved ab initio and template-based prediction of protein structures, naturally incorporate multiple templates, and yield state-of-the-art binary maps. Predictions of protein structures and 8 A contact maps based on the multi-class distance map predictors described in this paper are freely available to academic users at the url http://distill.ucd.ie/.
从蛋白质序列预测其结构仍是计算生物学中尚未解决的重大挑战之一。一些蛋白质结构预测方法,尤其是从头预测方法,在一定程度上依赖于残基接触图的预测。残基接触图预测在蛋白质结构预测技术关键评估(CASP)竞赛中已被评估多年。尽管已经表明精确的接触图通常能产生正确的三维结构,但这仅在相对较低的分辨率下成立(与天然结构相差3 - 4埃)。接触图的另一个已知弱点是它们通常是从头预测的,即没有利用已知结构的潜在同源物的信息。
我们引入了一类新的蛋白质结构距离约束:多类距离图。我们表明基于4类天然图的Cα迹线重建明显优于基于残基接触图的重建。然后,我们基于递归神经网络构建了两个4类图预测器:一个是从头预测器,依赖于序列和进化信息;另一个是基于模板的预测器,其中已知结构的同源信息作为额外输入。我们表明,实际上与结构模板的任何序列相似水平(低至低于10%)都能产生比从头预测器更准确的4类图。我们表明基于递归神经网络的基于模板的预测始终优于最佳模板以及一些最佳可用模板的组合。我们还从4类预测器中提取了8埃阈值下的二元残基接触图(按照CASP评估标准),并表明基于模板的版本也比最佳模板更准确,并且始终优于从头预测器,即使与结构模板的序列同一性非常低。此外,我们使用CASP7之前的蛋白质数据银行(PDB)对CASP7目标进行了从头预测和基于模板的8埃预测测试,发现两个预测器都是当前最先进的,如果有与查询序列同一性为10%或更高的模板,基于模板的预测器远远优于最佳的CASP7系统。虽然这不是本文的主要重点,但我们也报告了基于从头预测和基于模板的4类图预测的Cα迹线重建,表明即使同源性不确定,后者通常也更准确。
多类图的准确预测可能为改进蛋白质结构的从头预测和基于模板的预测提供有价值的约束,自然地纳入多个模板,并产生当前最先进的二元图。基于本文所述的多类距离图预测器的蛋白质结构和8埃接触图预测可在网址http://distill.ucd.ie/上免费提供给学术用户。