MOE Key Laboratory of Bioinformatics, School of Life Sciences.
Beijing Innovation Center of Structural Biology.
Bioinformatics. 2017 Sep 1;33(17):2675-2683. doi: 10.1093/bioinformatics/btx296.
Residue-residue contacts are of great value for protein structure prediction, since contact information, especially from those long-range residue pairs, can significantly reduce the complexity of conformational sampling for protein structure prediction in practice. Despite progresses in the past decade on protein targets with abundant homologous sequences, accurate contact prediction for proteins with limited sequence information is still far from satisfaction. Methodologies for these hard targets still need further improvement.
We presented a computational program DeepConPred, which includes a pipeline of two novel deep-learning-based methods (DeepCCon and DeepRCon) as well as a contact refinement step, to improve the prediction of long-range residue contacts from primary sequences. When compared with previous prediction approaches, our framework employed an effective scheme to identify optimal and important features for contact prediction, and was only trained with coevolutionary information derived from a limited number of homologous sequences to ensure robustness and usefulness for hard targets. Independent tests showed that 59.33%/49.97%, 64.39%/54.01% and 70.00%/59.81% of the top L/5, top L/10 and top 5 predictions were correct for CASP10/CASP11 proteins, respectively. In general, our algorithm ranked as one of the best methods for CASP targets.
All source data and codes are available at http://166.111.152.91/Downloads.html .
hgong@tsinghua.edu.cn or zengjy321@tsinghua.edu.cn.
Supplementary data are available at Bioinformatics online.
残基残基接触对于蛋白质结构预测非常有价值,因为接触信息,特别是来自那些远程残基对的接触信息,可以显著降低蛋白质结构预测中构象采样的复杂性。尽管在过去十年中,针对具有丰富同源序列的蛋白质靶标取得了进展,但对于具有有限序列信息的蛋白质,准确的接触预测仍远未令人满意。这些硬目标的方法仍然需要进一步改进。
我们提出了一个计算程序 DeepConPred,它包括两个新的基于深度学习的方法(DeepCCon 和 DeepRCon)以及一个接触精化步骤,以提高从原始序列预测远程残基接触的能力。与以前的预测方法相比,我们的框架采用了一种有效的方案来识别接触预测的最佳和重要特征,并且仅使用从有限数量的同源序列中得出的共进化信息进行训练,以确保对硬目标的稳健性和有用性。独立测试表明,对于 CASP10/CASP11 蛋白质,我们的方法分别有 59.33%/49.97%、64.39%/54.01%和 70.00%/59.81%的前 L/5、前 L/10 和前 5 预测是正确的。总的来说,我们的算法在 CASP 目标中排名前几位。
所有的源数据和代码都可以在 http://166.111.152.91/Downloads.html 上获得。
hgong@tsinghua.edu.cn 或 zengjy321@tsinghua.edu.cn。
补充数据可在 Bioinformatics 在线获得。