蛋白质的系统发育混合模型

Phylogenetic mixture models for proteins.

作者信息

Le Si Quang, Lartillot Nicolas, Gascuel Olivier

机构信息

Méthodes et Algorithmes pour Bioinformatique, LIRMM, CNRS - Université Montpellier II, 161 rue Ada, 34392 Montpellier Cedex 5, France.

出版信息

Philos Trans R Soc Lond B Biol Sci. 2008 Dec 27;363(1512):3965-76. doi: 10.1098/rstb.2008.0180.

DOI:10.1098/rstb.2008.0180

PMID:18852096

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2607422/

Abstract

Standard protein substitution models use a single amino acid replacement rate matrix that summarizes the biological, chemical and physical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors: genetic code; solvent exposure; secondary and tertiary structure; protein function; etc. These impact the substitution pattern and, in most cases, a single replacement matrix is not enough to represent all the complexity of the evolutionary processes. This paper explores in maximum-likelihood framework phylogenetic mixture models that combine several amino acid replacement matrices to better fit protein evolution.We learn these mixture models from a large alignment database extracted from HSSP, and test the performance using independent alignments from TREEBASE.We compare unsupervised learning approaches, where the site categories are unknown, to supervised ones, where in estimations we use the known category of each site, based on its exposure or its secondary structure. All our models are combined with gamma-distributed rates across sites. Results show that highly significant likelihood gains are obtained when using mixture models compared with the best available single replacement matrices. Mixtures of matrices also improve over mixtures of profiles in the manner of the CAT model. The unsupervised approach tends to be better than the supervised one, but it appears difficult to implement and highly sensitive to the starting values of the parameters, meaning that the supervised approach is still of interest for initialization and model comparison. Using an unsupervised model involving three matrices, the average AIC gain per site with TREEBASE test alignments is 0.31, 0.49 and 0.61 compared with LG (named after Le & Gascuel 2008 Mol. Biol. Evol. 25, 1307-1320), WAG and JTT, respectively. This three-matrix model is significantly better than LG for 34 alignments (among 57), and significantly worse for 1 alignment only. Moreover, tree topologies inferred with our mixture models frequently differ from those obtained with single matrices, indicating that using these mixtures impacts not only the likelihood value but also the output tree. All our models and a PhyML implementation are available from http://atgc.lirmm.fr/mixtures.

摘要

标准蛋白质替换模型使用单个氨基酸替换率矩阵，该矩阵概括了氨基酸的生物学、化学和物理特性。然而，位点进化具有高度的异质性，并且取决于许多因素：遗传密码；溶剂暴露；二级和三级结构；蛋白质功能等。这些因素会影响替换模式，在大多数情况下，单个替换矩阵不足以代表进化过程的所有复杂性。本文在最大似然框架下探索系统发育混合模型，该模型结合了多个氨基酸替换矩阵以更好地拟合蛋白质进化。我们从从HSSP提取的大型比对数据库中学习这些混合模型，并使用来自TREEBASE的独立比对来测试性能。我们将位点类别未知的无监督学习方法与有监督学习方法进行比较，在有监督学习方法中，我们在估计时根据每个位点的暴露情况或二级结构使用其已知类别。我们所有的模型都结合了跨位点的伽马分布速率。结果表明，与最佳可用的单个替换矩阵相比，使用混合模型可获得非常显著的似然增益。矩阵混合在改进CAT模型方面也优于轮廓混合。无监督方法往往比有监督方法更好，但它似乎难以实现且对参数的初始值高度敏感，这意味着有监督方法在初始化和模型比较方面仍然很有意义。使用涉及三个矩阵的无监督模型，与LG（以Le & Gascuel 2008年《分子生物学与进化》第25卷，1307 - 1320页命名）、WAG和JTT相比，TREEBASE测试比对中每个位点的平均AIC增益分别为0.31、0.49和0.61。这个三矩阵模型在57个比对中有34个比对显著优于LG，仅在1个比对中显著更差。此外，用我们的混合模型推断出的树拓扑结构经常与用单个矩阵获得的不同，这表明使用这些混合不仅会影响似然值，还会影响输出树。我们所有的模型和一个PhyML实现可从http://atgc.lirmm.fr/mixtures获取。

相似文献

Phylogenetic mixture models for proteins.

Philos Trans R Soc Lond B Biol Sci. 2008 Dec 27;363(1512):3965-76. doi: 10.1098/rstb.2008.0180.

An improved general amino acid replacement matrix.

Mol Biol Evol. 2008 Jul;25(7):1307-20. doi: 10.1093/molbev/msn067. Epub 2008 Mar 26.

Modeling protein evolution with several amino acid replacement matrices depending on site rates.

Mol Biol Evol. 2012 Oct;29(10):2921-36. doi: 10.1093/molbev/mss112. Epub 2012 Apr 6.

ReplacementMatrix: a web server for maximum-likelihood estimation of amino acid replacement rate matrices.

Bioinformatics. 2011 Oct 1;27(19):2758-60. doi: 10.1093/bioinformatics/btr435. Epub 2011 Jul 26.

Accounting for solvent accessibility and secondary structure in protein phylogenetics is clearly beneficial.

Syst Biol. 2010 May;59(3):277-87. doi: 10.1093/sysbio/syq002. Epub 2010 Mar 10.

A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny.

BMC Evol Biol. 2008 Dec 16;8:331. doi: 10.1186/1471-2148-8-331.

Empirical profile mixture models for phylogenetic reconstruction.

Bioinformatics. 2008 Oct 15;24(20):2317-23. doi: 10.1093/bioinformatics/btn445. Epub 2008 Aug 21.

Improving phylogenetic inference with a semiempirical amino acid substitution model.

Mol Biol Evol. 2013 Feb;30(2):469-79. doi: 10.1093/molbev/mss229. Epub 2012 Sep 21.

An amino acid substitution-selection model adjusts residue fitness to improve phylogenetic estimation.

Mol Biol Evol. 2014 Apr;31(4):779-92. doi: 10.1093/molbev/msu044. Epub 2014 Jan 16.

GTRpmix: A Linked General Time-Reversible Model for Profile Mixture Models.

Mol Biol Evol. 2024 Sep 4;41(9). doi: 10.1093/molbev/msae174.

引用本文的文献

Genomic Insights into Fertilization: Tracing PLCZ1 Orthologs Across Amphibian Lineages.

Genome Biol Evol. 2025 Apr 3;17(4). doi: 10.1093/gbe/evaf052.

MixtureFinder: Estimating DNA Mixture Models for Phylogenetic Analyses.

Mol Biol Evol. 2025 Jan 6;42(1). doi: 10.1093/molbev/msae264.

Phylogenomics supports a single origin of terrestriality in isopods.

Proc Biol Sci. 2024 Oct;291(2033):20241042. doi: 10.1098/rspb.2024.1042. Epub 2024 Oct 30.

Monogalactosyldiacylglycerol synthase isoforms play diverse roles inside and outside the diatom plastid.

Plant Cell. 2024 Oct 9;36(12):5023-49. doi: 10.1093/plcell/koae275.

Insights on mining the pangenome of NMS02 S296 from the resistant banana cultivar confirms the antifungal action against f. sp. .

Front Microbiol. 2024 Sep 19;15:1443195. doi: 10.3389/fmicb.2024.1443195. eCollection 2024.

Illuminating the coevolution of photosynthesis and Bacteria.

Proc Natl Acad Sci U S A. 2024 Jun 18;121(25):e2322120121. doi: 10.1073/pnas.2322120121. Epub 2024 Jun 14.

Accurate Detection of Convergent Mutations in Large Protein Alignments With ConDor.

Genome Biol Evol. 2024 Apr 2;16(4). doi: 10.1093/gbe/evae040.

Callose in leptoid cell walls of the moss and the evolution of callose synthase across bryophytes.

Front Plant Sci. 2024 Feb 7;15:1357324. doi: 10.3389/fpls.2024.1357324. eCollection 2024.

Novel order-level lineage of ammonia-oxidizing archaea widespread in marine and terrestrial environments.

ISME J. 2024 Jan 8;18(1). doi: 10.1093/ismejo/wrad002.

Create, Analyze, and Visualize Phylogenomic Datasets Using PhyloFisher.

Curr Protoc. 2024 Jan;4(1):e969. doi: 10.1002/cpz1.969.

本文引用的文献

Empirical profile mixture models for phylogenetic reconstruction.

Bioinformatics. 2008 Oct 15;24(20):2317-23. doi: 10.1093/bioinformatics/btn445. Epub 2008 Aug 21.

An improved general amino acid replacement matrix.

Mol Biol Evol. 2008 Jul;25(7):1307-20. doi: 10.1093/molbev/msn067. Epub 2008 Mar 26.

Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model.

BMC Evol Biol. 2007 Feb 8;7 Suppl 1(Suppl 1):S4. doi: 10.1186/1471-2148-7-S1-S4.

XRate: a fast prototyping, training and annotation tool for phylo-grammars.

BMC Bioinformatics. 2006 Oct 3;7:428. doi: 10.1186/1471-2105-7-428.

Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified.

BMC Evol Biol. 2006 Mar 24;6:29. doi: 10.1186/1471-2148-6-29.

Improving the efficiency of SPR moves in phylogenetic tree search methods based on maximum likelihood.

Bioinformatics. 2005 Dec 15;21(24):4338-47. doi: 10.1093/bioinformatics/bti713. Epub 2005 Oct 18.

An alternative model of amino acid replacement.

Bioinformatics. 2005 Apr 1;21(7):975-80. doi: 10.1093/bioinformatics/bti109. Epub 2004 Nov 5.

A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process.

Mol Biol Evol. 2004 Jun;21(6):1095-109. doi: 10.1093/molbev/msh112. Epub 2004 Mar 10.

A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood.

Syst Biol. 2003 Oct;52(5):696-704. doi: 10.1080/10635150390235520.

An expectation maximization algorithm for training hidden substitution models.

J Mol Biol. 2002 Apr 12;317(5):753-64. doi: 10.1006/jmbi.2002.5405.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

蛋白质的系统发育混合模型

Phylogenetic mixture models for proteins.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献