School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China.
Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA.
Proteins. 2019 Dec;87(12):1082-1091. doi: 10.1002/prot.25798. Epub 2019 Aug 22.
We report the results of residue-residue contact prediction of a new pipeline built purely on the learning of coevolutionary features in the CASP13 experiment. For a query sequence, the pipeline starts with the collection of multiple sequence alignments (MSAs) from multiple genome and metagenome sequence databases using two complementary Hidden Markov Model (HMM)-based searching tools. Three profile matrices, built on covariance, precision, and pseudolikelihood maximization respectively, are then created from the MSAs, which are used as the input features of a deep residual convolutional neural network architecture for contact-map training and prediction. Two ensembling strategies have been proposed to integrate the matrix features through end-to-end training and stacking, resulting in two complementary programs called TripletRes and ResTriplet, respectively. For the 31 free-modeling domains that do not have homologous templates in the PDB, TripletRes and ResTriplet generated comparable results with an average accuracy of 0.640 and 0.646, respectively, for the top L/5 long-range predictions, where 71% and 74% of the cases have an accuracy above 0.5. Detailed data analyses showed that the strength of the pipeline is due to the sensitive MSA construction and the advanced strategies for coevolutionary feature ensembling. Domain splitting was also found to help enhance the contact prediction performance. Nevertheless, contact models for tail regions, which often involve a high number of alignment gaps, and for targets with few homologous sequences are still suboptimal. Development of new approaches where the model is specifically trained on these regions and targets might help address these problems.
我们报告了在 CASP13 实验中纯粹基于共进化特征学习构建的新管道的残基-残基接触预测结果。对于查询序列,该管道首先使用两种互补的基于隐马尔可夫模型 (HMM) 的搜索工具从多个基因组和宏基因组序列数据库中收集多重序列比对 (MSA)。然后,从 MSAs 创建三个基于协方差、精度和伪似然最大化的轮廓矩阵,这些矩阵用作接触图训练和预测的深度残差卷积神经网络架构的输入特征。提出了两种集成策略来通过端到端训练和堆叠整合矩阵特征,从而产生了两个互补的程序,分别称为 TripletRes 和 ResTriplet。对于 31 个没有在 PDB 中同源模板的自由建模域,TripletRes 和 ResTriplet 生成了可比的结果,对于前 L/5 个长程预测的平均准确率分别为 0.640 和 0.646,其中 71%和 74%的情况下准确率高于 0.5。详细的数据分析表明,该管道的优势在于敏感的 MSA 构建和先进的共进化特征集成策略。还发现域分割有助于提高接触预测性能。然而,尾部区域的接触模型,通常涉及大量对齐间隙,以及同源序列较少的目标,仍然不理想。开发专门针对这些区域和目标进行模型训练的新方法可能有助于解决这些问题。