核苷酸替换模型的选择在拓扑结构上重要吗？

Does the choice of nucleotide substitution models matter topologically?

作者信息

Hoff Michael, Orf Stefan, Riehm Benedikt, Darriba Diego, Stamatakis Alexandros

机构信息

Karlsruhe Institute of Technology, Department of Informatics, Kaiserstraße 12, Karlsruhe, 76131, Germany.

The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Schloss-Wolfsbrunnenweg 35, Heidelberg, 69118, Germany.

出版信息

BMC Bioinformatics. 2016 Mar 24;17:143. doi: 10.1186/s12859-016-0985-x.

DOI:10.1186/s12859-016-0985-x

PMID:27009141

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4806516/

Abstract

BACKGROUND

In the context of a master level programming practical at the computer science department of the Karlsruhe Institute of Technology, we developed and make available an open-source code for testing all 203 possible nucleotide substitution models in the Maximum Likelihood (ML) setting under the common Akaike, corrected Akaike, and Bayesian information criteria. We address the question if model selection matters topologically, that is, if conducting ML inferences under the optimal, instead of a standard General Time Reversible model, yields different tree topologies. We also assess, to which degree models selected and trees inferred under the three standard criteria (AIC, AICc, BIC) differ. Finally, we assess if the definition of the sample size (#sites versus #sites × #taxa) yields different models and, as a consequence, different tree topologies.

RESULTS

We find that, all three factors (by order of impact: nucleotide model selection, information criterion used, sample size definition) can yield topologically substantially different final tree topologies (topological difference exceeding 10 %) for approximately 5 % of the tree inferences conducted on the 39 empirical datasets used in our study.

CONCLUSIONS

We find that, using the best-fit nucleotide substitution model may change the final ML tree topology compared to an inference under a default GTR model. The effect is less pronounced when comparing distinct information criteria. Nonetheless, in some cases we did obtain substantial topological differences.

摘要

背景

在卡尔斯鲁厄理工学院计算机科学系的硕士水平编程实践中，我们开发并提供了一个开源代码，用于在常用的赤池信息准则、修正赤池信息准则和贝叶斯信息准则下，在最大似然（ML）设置中测试所有203种可能的核苷酸替换模型。我们探讨了模型选择在拓扑结构上是否重要的问题，也就是说，在最优模型而非标准的通用时间可逆模型下进行ML推断是否会产生不同的树拓扑结构。我们还评估了在三个标准准则（AIC、AICc、BIC）下选择的模型和推断的树之间的差异程度。最后，我们评估样本量的定义（#位点与#位点×#分类单元）是否会产生不同的模型，进而产生不同的树拓扑结构。

结果

我们发现，对于我们研究中使用的39个经验数据集上进行的大约5%的树推断，所有三个因素（按影响程度排序：核苷酸模型选择、使用的信息准则、样本量定义）都可能产生拓扑结构上有显著差异的最终树拓扑结构（拓扑差异超过10%）。

结论

我们发现，与在默认GTR模型下进行推断相比，使用最佳拟合核苷酸替换模型可能会改变最终的ML树拓扑结构。在比较不同的信息准则时，这种影响不太明显。尽管如此，在某些情况下我们确实获得了显著的拓扑差异。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7121/4806516/0a33c854ffc4/12859_2016_985_Fig1_HTML.jpg

相似文献

Does the choice of nucleotide substitution models matter topologically?核苷酸替换模型的选择在拓扑结构上重要吗？

BMC Bioinformatics. 2016 Mar 24;17:143. doi: 10.1186/s12859-016-0985-x.

Does choice in model selection affect maximum likelihood analysis?模型选择中的选择会影响最大似然分析吗？

Syst Biol. 2008 Feb;57(1):76-85. doi: 10.1080/10635150801898920.

The devil in the details: interactions between the branch-length prior and likelihood model affect node support and branch lengths in the phylogeny of the Psoraceae.细节中的魔鬼：分支长度先验和似然模型之间的相互作用影响了 Psoraceae 系统发育中的节点支持和分支长度。

Syst Biol. 2011 Jul;60(4):541-61. doi: 10.1093/sysbio/syr022. Epub 2011 Mar 24.

On the Use of Information Criteria for Model Selection in Phylogenetics.关于信息准则在系统发育学模型选择中的应用。

Mol Biol Evol. 2020 Feb 1;37(2):549-562. doi: 10.1093/molbev/msz228.

The Limits of the Constant-rate Birth-Death Prior for Phylogenetic Tree Topology Inference.《系统发育树拓扑推断中恒定速率 Birth-Death 先验的局限性》。

Syst Biol. 2024 May 27;73(1):235-246. doi: 10.1093/sysbio/syad075.

Bayesian phylogenetic model selection using reversible jump Markov chain Monte Carlo.使用可逆跳跃马尔可夫链蒙特卡罗方法进行贝叶斯系统发育模型选择。

Mol Biol Evol. 2004 Jun;21(6):1123-33. doi: 10.1093/molbev/msh123. Epub 2004 Mar 19.

The effect of branch length variation on the selection of models of molecular evolution.分支长度变异对分子进化模型选择的影响。

J Mol Evol. 2001 May;52(5):434-44. doi: 10.1007/s002390010173.

Selecting the best-fit model of nucleotide substitution.选择最佳拟合的核苷酸替换模型。

Syst Biol. 2001 Aug;50(4):580-601.

Assessment of substitution model adequacy using frequentist and Bayesian methods.使用频率论和贝叶斯方法评估替代模型的充分性。

Mol Biol Evol. 2010 Dec;27(12):2790-803. doi: 10.1093/molbev/msq168. Epub 2010 Jul 8.

Data-specific substitution models improve protein-based phylogenetics.基于数据的替代模型可提高基于蛋白质的系统发育分析。

PeerJ. 2023 Aug 8;11:e15716. doi: 10.7717/peerj.15716. eCollection 2023.

引用本文的文献

The impact of software and criteria on the selection of best-fit nucleotide substitution models for molecular evolutionary genetic analysis.软件和标准对分子进化遗传分析中最佳拟合核苷酸替换模型选择的影响。

PLoS One. 2025 Mar 26;20(3):e0319774. doi: 10.1371/journal.pone.0319774. eCollection 2025.

Genomic incongruence accompanies the evolution of flower symmetry in Eudicots: a case study in the poppy family (Papaveraceae, Ranunculales).基因组不一致现象伴随真双子叶植物花对称性的演化：以罂粟科（罂粟科，毛茛目）为例的研究。

Front Plant Sci. 2024 Jun 14;15:1340056. doi: 10.3389/fpls.2024.1340056. eCollection 2024.

Fast-Evolving Alignment Sites Are Highly Informative for Reconstructions of Deep Tree of Life Phylogenies.快速进化的比对位点对重建生命之树的深层系统发育具有高度信息价值。

Microorganisms. 2023 Oct 5;11(10):2499. doi: 10.3390/microorganisms11102499.

Viral genome sequence datasets display pervasive evidence of strand-specific substitution biases that are best described using non-reversible nucleotide substitution models.病毒基因组序列数据集显示出普遍存在的链特异性替代偏差证据，使用不可逆核苷酸替代模型能对其进行最佳描述。

Res Sq. 2022 Dec 29:rs.3.rs-2407778. doi: 10.21203/rs.3.rs-2407778/v1.

Taming the Selection of Optimal Substitution Models in Phylogenomics by Site Subsampling and Upsampling.通过位点抽样和上采样来驯服系统发育基因组学中最优替代模型的选择。

Mol Biol Evol. 2022 Nov 3;39(11). doi: 10.1093/molbev/msac236.

Comparative Genomic Analysis of Pseudoxanthomonas sp. X-1, a Bromoxynil Octanoate-Degrading Bacterium, and Its Related Type Strains.假单胞菌 X-1 的比较基因组分析，一种溴苯腈辛酸酯降解菌，及其相关的模式菌株。

Curr Microbiol. 2022 Jan 20;79(2):65. doi: 10.1007/s00284-021-02735-y.

Genomic Characterization of sp. nov., a Biofilm-Forming Fungus Isolated from Mars 2020 Assembly Facility.从火星2020组装设施分离出的一种形成生物膜的真菌——[具体菌种名称]的基因组特征分析。需注意，原文中“sp. nov.”部分应替换为具体的菌种名称，这里按要求保留原样进行了翻译表述。

J Fungi (Basel). 2022 Jan 9;8(1):66. doi: 10.3390/jof8010066.

The Diversity, Metabolomics Profiling, and the Pharmacological Potential of Actinomycetes Isolated from the Estremadura Spur Pockmarks (Portugal).从埃斯特雷马杜拉海脊结核（葡萄牙）中分离出的放线菌的多样性、代谢组学分析及药理学潜力。

Mar Drugs. 2021 Dec 23;20(1):21. doi: 10.3390/md20010021.

Felsenstein Phylogenetic Likelihood.费雪氏系统发生似然

J Mol Evol. 2021 Apr;89(3):134-145. doi: 10.1007/s00239-020-09982-w. Epub 2021 Jan 13.

Multiple Sequence Alignment for Large Heterogeneous Datasets Using SATé, PASTA, and UPP.使用SATé、PASTA和UPP对大型异构数据集进行多序列比对。

Methods Mol Biol. 2021;2231:99-119. doi: 10.1007/978-1-0716-1036-7_7.

本文引用的文献

The phylogenetic likelihood library.系统发育似然库。

Syst Biol. 2015 Mar;64(2):356-62. doi: 10.1093/sysbio/syu084. Epub 2014 Oct 30.

RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies.RAxML 版本 8：用于系统发育分析和大型系统发育后分析的工具。

Bioinformatics. 2014 May 1;30(9):1312-3. doi: 10.1093/bioinformatics/btu033. Epub 2014 Jan 21.

jModelTest 2: more models, new heuristics and parallel computing.jModelTest 2：更多模型、新启发式方法与并行计算。

Nat Methods. 2012 Jul 30;9(8):772. doi: 10.1038/nmeth.2109.

MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space.MrBayes 3.2：在大型模型空间中进行高效的贝叶斯系统发育推断和模型选择。

Syst Biol. 2012 May;61(3):539-42. doi: 10.1093/sysbio/sys029. Epub 2012 Feb 22.

New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0.新算法和方法估计最大似然系统发育：评估 PhyML 3.0 的性能。

Syst Biol. 2010 May;59(3):307-21. doi: 10.1093/sysbio/syq010. Epub 2010 Mar 29.

A nuclear ribosomal DNA phylogeny of acer inferred with maximum likelihood, splits graphs, and motif analysis of 606 sequences.利用最大似然法、分裂图和 606 条序列的模体分析对 Acer 进行核核糖体 DNA 系统发育推断。

Evol Bioinform Online. 2007 Feb 17;2:7-22.

INDELible: a flexible simulator of biological sequence evolution.INDELible：一款灵活的生物序列进化模拟器。

Mol Biol Evol. 2009 Aug;26(8):1879-88. doi: 10.1093/molbev/msp098. Epub 2009 May 7.

Efficiency of Markov chain Monte Carlo tree proposals in Bayesian phylogenetics.贝叶斯系统发育学中马尔可夫链蒙特卡罗树提议的效率

Syst Biol. 2008 Feb;57(1):86-103. doi: 10.1080/10635150801886156.

Does choice in model selection affect maximum likelihood analysis?模型选择中的选择会影响最大似然分析吗？

Syst Biol. 2008 Feb;57(1):76-85. doi: 10.1080/10635150801898920.

Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests.系统发育学中的模型选择与模型平均：赤池信息准则和贝叶斯方法相对于似然比检验的优势

Syst Biol. 2004 Oct;53(5):793-808. doi: 10.1080/10635150490522304.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

核苷酸替换模型的选择在拓扑结构上重要吗？

Does the choice of nucleotide substitution models matter topologically?

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献