Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA.
Department of Mathematics and Computer Science, University of Missouri, St. Louis, MO 63121, USA.
Bioinformatics. 2020 Feb 15;36(4):1091-1098. doi: 10.1093/bioinformatics/btz679.
Deep learning has become the dominant technology for protein contact prediction. However, the factors that affect the performance of deep learning in contact prediction have not been systematically investigated.
We analyzed the results of our three deep learning-based contact prediction methods (MULTICOM-CLUSTER, MULTICOM-CONSTRUCT and MULTICOM-NOVEL) in the CASP13 experiment and identified several key factors [i.e. deep learning technique, multiple sequence alignment (MSA), distance distribution prediction and domain-based contact integration] that influenced the contact prediction accuracy. We compared our convolutional neural network (CNN)-based contact prediction methods with three coevolution-based methods on 75 CASP13 targets consisting of 108 domains. We demonstrated that the CNN-based multi-distance approach was able to leverage global coevolutionary coupling patterns comprised of multiple correlated contacts for more accurate contact prediction than the local coevolution-based methods, leading to a substantial increase of precision by 19.2 percentage points. We also tested different alignment methods and domain-based contact prediction with the deep learning contact predictors. The comparison of the three methods showed deeper sequence alignments and the integration of domain-based contact prediction with the full-length contact prediction improved the performance of contact prediction. Moreover, we demonstrated that the domain-based contact prediction based on a novel ab initio approach of parsing domains from MSAs alone without using known protein structures was a simple, fast approach to improve contact prediction. Finally, we showed that predicting the distribution of inter-residue distances in multiple distance intervals could capture more structural information and improve binary contact prediction.
https://github.com/multicom-toolbox/DNCON2/.
Supplementary data are available at Bioinformatics online.
深度学习已成为蛋白质接触预测的主导技术。然而,影响深度学习在接触预测中性能的因素尚未得到系统研究。
我们分析了我们在 CASP13 实验中三种基于深度学习的接触预测方法(MULTICOM-CLUSTER、MULTICOM-CONSTRUCT 和 MULTICOM-NOVEL)的结果,确定了几个关键因素[即深度学习技术、多重序列比对(MSA)、距离分布预测和基于域的接触整合],这些因素影响接触预测精度。我们将基于卷积神经网络(CNN)的接触预测方法与三种基于共进化的方法在由 108 个域组成的 75 个 CASP13 目标上进行了比较。我们证明,基于 CNN 的多距离方法能够利用由多个相关接触组成的全局共进化耦合模式进行更准确的接触预测,比基于局部共进化的方法提高了 19.2 个百分点的精度。我们还测试了不同的对齐方法和基于域的接触预测与深度学习接触预测器。三种方法的比较表明,更深层次的序列对齐以及基于域的接触预测与全长接触预测的整合提高了接触预测的性能。此外,我们证明了基于从头开始解析 MSAs 中域的新方法的基于域的接触预测,而无需使用已知的蛋白质结构,是一种简单、快速的方法来提高接触预测。最后,我们表明,预测多个距离间隔内的残基间距离分布可以捕获更多的结构信息,并提高二进制接触预测。
https://github.com/multicom-toolbox/DNCON2/。
补充数据可在 Bioinformatics 在线获得。