CCMgen 通过合成蛋白比对量化残基残基接触预测中的噪声。

Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction.

机构信息

Quantitative and Computational Biology Group, Max-Planck Institute for Biophysical Chemistry, Göttingen, Germany.

出版信息

PLoS Comput Biol. 2018 Nov 5;14(11):e1006526. doi: 10.1371/journal.pcbi.1006526. eCollection 2018 Nov.

DOI:10.1371/journal.pcbi.1006526

PMID:30395601

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6237422/

Abstract

Compensatory mutations between protein residues in physical contact can manifest themselves as statistical couplings between the corresponding columns in a multiple sequence alignment (MSA) of the protein family. Conversely, large coupling coefficients predict residue contacts. Methods for de-novo protein structure prediction based on this approach are becoming increasingly reliable. Their main limitation is the strong systematic and statistical noise in the estimation of coupling coefficients, which has so far limited their application to very large protein families. While most research has focused on improving predictions by adding external information, little progress has been made to improve the statistical procedure at the core, because our lack of understanding of the sources of noise poses a major obstacle. First, we show theoretically that the expectation value of the coupling score assuming no coupling is proportional to the product of the square roots of the column entropies, and we propose a simple entropy bias correction (EntC) that subtracts out this expectation value. Second, we show that the average product correction (APC) includes the correction of the entropy bias, partly explaining its success. Third, we have developed CCMgen, the first method for simulating protein evolution and generating realistic synthetic MSAs with pairwise statistical residue couplings. Fourth, to learn exact statistical models that reliably reproduce observed alignment statistics, we developed CCMpredPy, an implementation of the persistent contrastive divergence (PCD) method for exact inference. Fifth, we demonstrate how CCMgen and CCMpredPy can facilitate the development of contact prediction methods by analysing the systematic noise contributions from phylogeny and entropy. Using the entropy bias correction, we can disentangle both sources of noise and find that entropy contributes roughly twice as much noise as phylogeny.

摘要

物理接触的蛋白质残基之间的补偿突变可以表现为蛋白质家族多重序列比对（MSA）中相应列之间的统计耦合。相反，大的耦合系数预测残基接触。基于这种方法的从头蛋白质结构预测方法变得越来越可靠。它们的主要限制是耦合系数估计中的系统和统计噪声很强，这迄今为止限制了它们在非常大的蛋白质家族中的应用。虽然大多数研究都集中在通过添加外部信息来改进预测，但在改进核心统计过程方面进展甚微，因为我们对噪声源的了解不足构成了主要障碍。首先，我们从理论上表明，假设没有耦合的耦合得分的期望值与列熵的平方根的乘积成正比，并且我们提出了一种简单的熵偏差校正（EntC）来减去该期望值。其次，我们表明，平均乘积校正（APC）包括对熵偏差的校正，这部分解释了其成功的原因。第三，我们开发了 CCMgen，这是第一个用于模拟蛋白质进化并生成具有成对统计残基耦合的现实合成 MSA 的方法。第四，为了学习可靠地再现观察到的对齐统计数据的精确统计模型，我们开发了 CCMpredPy，这是持久对比散度（PCD）方法的实现，用于精确推断。第五，我们通过分析来自系统发育和熵的系统噪声贡献来演示 CCMgen 和 CCMpredPy 如何促进接触预测方法的发展。使用熵偏差校正，我们可以分离这两个噪声源，并发现熵贡献的噪声大致是系统发育的两倍。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f13c/6237422/6550a71009cb/pcbi.1006526.g001.jpg

相似文献

Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction.

PLoS Comput Biol. 2018 Nov 5;14(11):e1006526. doi: 10.1371/journal.pcbi.1006526. eCollection 2018 Nov.

Coupled mutation finder: a new entropy-based method quantifying phylogenetic noise for the detection of compensatory mutations.

BMC Bioinformatics. 2012 Sep 11;13:225. doi: 10.1186/1471-2105-13-225.

Predicting protein β-sheet contacts using a maximum entropy-based correlated mutation measure.

Bioinformatics. 2013 Mar 1;29(5):580-7. doi: 10.1093/bioinformatics/btt005. Epub 2013 Jan 10.

PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments.

Bioinformatics. 2012 Jan 15;28(2):184-90. doi: 10.1093/bioinformatics/btr638. Epub 2011 Nov 17.

Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction.

Bioinformatics. 2008 Feb 1;24(3):333-40. doi: 10.1093/bioinformatics/btm604. Epub 2007 Dec 5.

Improving protein-protein interaction prediction using evolutionary information from low-quality MSAs.

PLoS One. 2017 Feb 6;12(2):e0169356. doi: 10.1371/journal.pone.0169356. eCollection 2017.

Improving residue-residue contact prediction via low-rank and sparse decomposition of residue correlation matrix.

Biochem Biophys Res Commun. 2016 Mar 25;472(1):217-22. doi: 10.1016/j.bbrc.2016.01.188. Epub 2016 Feb 23.

From principal component to direct coupling analysis of coevolution in proteins: low-eigenvalue modes are needed for structure prediction.

PLoS Comput Biol. 2013;9(8):e1003176. doi: 10.1371/journal.pcbi.1003176. Epub 2013 Aug 22.

Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information.

Bioinformatics. 2009 May 1;25(9):1125-31. doi: 10.1093/bioinformatics/btp135. Epub 2009 Mar 10.

Direct coevolutionary couplings reflect biophysical residue interactions in proteins.

J Chem Phys. 2016 Nov 7;145(17):174102. doi: 10.1063/1.4966156.

引用本文的文献

Impact of phylogeny on the inference of functional sectors from protein sequence data.

PLoS Comput Biol. 2024 Sep 23;20(9):e1012091. doi: 10.1371/journal.pcbi.1012091. eCollection 2024 Sep.

Enhancing coevolutionary signals in protein-protein interaction prediction through clade-wise alignment integration.

Sci Rep. 2024 Mar 12;14(1):6009. doi: 10.1038/s41598-024-55655-9.

Chasing long-range evolutionary couplings in the AlphaFold era.

Biopolymers. 2023 Mar;114(3):e23530. doi: 10.1002/bip.23530. Epub 2023 Feb 8.

Impact of phylogeny on structural contact inference from protein sequence data.

J R Soc Interface. 2023 Feb;20(199):20220707. doi: 10.1098/rsif.2022.0707. Epub 2023 Feb 8.

Generative power of a protein language model trained on multiple sequence alignments.

Elife. 2023 Feb 3;12:e79854. doi: 10.7554/eLife.79854.

Protein language models trained on multiple sequence alignments learn phylogenetic relationships.

Nat Commun. 2022 Oct 22;13(1):6298. doi: 10.1038/s41467-022-34032-y.

Correlations from structure and phylogeny combine constructively in the inference of protein partners from sequences.

PLoS Comput Biol. 2022 May 16;18(5):e1010147. doi: 10.1371/journal.pcbi.1010147. eCollection 2022 May.

Extracting phylogenetic dimensions of coevolution reveals hidden functional signals.

Sci Rep. 2022 Jan 17;12(1):820. doi: 10.1038/s41598-021-04260-1.

Efficient generative modeling of protein sequences using simple autoregressive models.

Nat Commun. 2021 Oct 4;12(1):5800. doi: 10.1038/s41467-021-25756-4.

On the effect of phylogenetic correlations in coevolution-based contact prediction in proteins.

PLoS Comput Biol. 2021 May 24;17(5):e1008957. doi: 10.1371/journal.pcbi.1008957. eCollection 2021 May.

本文引用的文献

High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features.

Bioinformatics. 2018 Oct 1;34(19):3308-3315. doi: 10.1093/bioinformatics/bty341.

How Pairwise Coevolutionary Models Capture the Collective Residue Variability in Proteins?

Mol Biol Evol. 2018 Apr 1;35(4):1018-1027. doi: 10.1093/molbev/msy007.

Power law tails in phylogenetic systems.

Proc Natl Acad Sci U S A. 2018 Jan 23;115(4):690-695. doi: 10.1073/pnas.1711913115. Epub 2018 Jan 8.

Inverse statistical physics of protein sequences: a key issues review.

Rep Prog Phys. 2018 Mar;81(3):032601. doi: 10.1088/1361-6633/aa9965.

Predicting accurate contacts in thousands of Pfam domain families using PconsC3.

Bioinformatics. 2017 Sep 15;33(18):2859-2866. doi: 10.1093/bioinformatics/btx332.

NeBcon: protein contact map prediction using neural network training coupled with naïve Bayes classifiers.

Bioinformatics. 2017 Aug 1;33(15):2296-2306. doi: 10.1093/bioinformatics/btx164.

Protein structure determination using metagenome sequence data.

Science. 2017 Jan 20;355(6322):294-298. doi: 10.1126/science.aah4043.

Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model.

PLoS Comput Biol. 2017 Jan 5;13(1):e1005324. doi: 10.1371/journal.pcbi.1005324. eCollection 2017 Jan.

ACE: adaptive cluster expansion for maximum entropy graphical model inference.

Bioinformatics. 2016 Oct 15;32(20):3089-3097. doi: 10.1093/bioinformatics/btw328. Epub 2016 Jun 21.

Structural propensities of kinase family proteins from a Potts model of residue co-variation.

Protein Sci. 2016 Aug;25(8):1378-84. doi: 10.1002/pro.2954. Epub 2016 Jun 26.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

CCMgen 通过合成蛋白比对量化残基残基接触预测中的噪声。

Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction.

机构信息

Quantitative and Computational Biology Group, Max-Planck Institute for Biophysical Chemistry, Göttingen, Germany.

出版信息

PLoS Comput Biol. 2018 Nov 5;14(11):e1006526. doi: 10.1371/journal.pcbi.1006526. eCollection 2018 Nov.

DOI:10.1371/journal.pcbi.1006526

PMID:30395601

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6237422/

Abstract

摘要

CCMgen 通过合成蛋白比对量化残基残基接触预测中的噪声。

Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

CCMgen 通过合成蛋白比对量化残基残基接触预测中的噪声。

Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献