歧义数据对最大似然法和贝叶斯推断得出的系统发育估计的影响。

The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and Bayesian inference.

机构信息

Section of Integrative Biology, University of Texas at Austin, 1 University Station C0930, Austin, TX 78712, USA.

出版信息

Syst Biol. 2009 Feb;58(1):130-45. doi: 10.1093/sysbio/syp017. Epub 2009 May 22.

DOI:10.1093/sysbio/syp017

PMID:20525573

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7539334/

Abstract

Although an increasing number of phylogenetic data sets are incomplete, the effect of ambiguous data on phylogenetic accuracy is not well understood. We use 4-taxon simulations to study the effects of ambiguous data (i.e., missing characters or gaps) in maximum likelihood (ML) and Bayesian frameworks. By introducing ambiguous data in a way that removes confounding factors, we provide the first clear understanding of 1 mechanism by which ambiguous data can mislead phylogenetic analyses. We find that in both ML and Bayesian frameworks, among-site rate variation can interact with ambiguous data to produce misleading estimates of topology and branch lengths. Furthermore, within a Bayesian framework, priors on branch lengths and rate heterogeneity parameters can exacerbate the effects of ambiguous data, resulting in strongly misleading bipartition posterior probabilities. The magnitude and direction of the ambiguous data bias are a function of the number and taxonomic distribution of ambiguous characters, the strength of topological support, and whether or not the model is correctly specified. The results of this study have major implications for all analyses that rely on accurate estimates of topology or branch lengths, including divergence time estimation, ancestral state reconstruction, tree-dependent comparative methods, rate variation analysis, phylogenetic hypothesis testing, and phylogeographic analysis.

摘要

尽管越来越多的系统发育数据集是不完整的，但模糊数据对系统发育准确性的影响还没有得到很好的理解。我们使用四分类模拟来研究最大似然法（ML）和贝叶斯框架中模糊数据（即缺失字符或空位）的影响。通过以一种消除混杂因素的方式引入模糊数据，我们首次清楚地了解了模糊数据可能误导系统发育分析的一种机制。我们发现，在 ML 和贝叶斯框架中，种间速率变化可以与模糊数据相互作用，从而产生拓扑结构和分支长度的误导性估计。此外，在贝叶斯框架内，分支长度和速率异质性参数的先验概率可以加剧模糊数据的影响，导致强烈误导的二分体后验概率。模糊数据偏差的幅度和方向是模糊字符的数量和分类分布、拓扑结构支持的强度以及模型是否正确指定的函数。本研究的结果对所有依赖于拓扑结构或分支长度的准确估计的分析都有重大影响，包括分歧时间估计、祖先状态重建、基于树的比较方法、速率变化分析、系统发育假设检验和系统地理学分析。

相似文献

The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and Bayesian inference.歧义数据对最大似然法和贝叶斯推断得出的系统发育估计的影响。

Syst Biol. 2009 Feb;58(1):130-45. doi: 10.1093/sysbio/syp017. Epub 2009 May 22.

Branch length estimation and divergence dating: estimates of error in Bayesian and maximum likelihood frameworks.支长估计和分歧日期：贝叶斯和最大似然框架中的误差估计。

BMC Evol Biol. 2010 Jan 11;10:5. doi: 10.1186/1471-2148-10-5.

Bayesian and maximum likelihood phylogenetic analyses of protein sequence data under relative branch-length differences and model violation.基于相对分支长度差异和模型违背情况下蛋白质序列数据的贝叶斯和最大似然系统发育分析。

BMC Evol Biol. 2005 Jan 28;5:8. doi: 10.1186/1471-2148-5-8.

The devil in the details: interactions between the branch-length prior and likelihood model affect node support and branch lengths in the phylogeny of the Psoraceae.细节中的魔鬼：分支长度先验和似然模型之间的相互作用影响了 Psoraceae 系统发育中的节点支持和分支长度。

Syst Biol. 2011 Jul;60(4):541-61. doi: 10.1093/sysbio/syr022. Epub 2011 Mar 24.

Robustness of compound Dirichlet priors for Bayesian inference of branch lengths.复合狄利克雷先验对支长贝叶斯推断稳健性的研究。

Syst Biol. 2012 Oct;61(5):779-84. doi: 10.1093/sysbio/sys030. Epub 2012 Feb 10.

A confounding effect of missing data on character conflict in maximum likelihood and Bayesian MCMC phylogenetic analyses.缺失数据对最大似然法和贝叶斯MCMC系统发育分析中特征冲突的混杂效应。

Mol Phylogenet Evol. 2014 Nov;80:267-80. doi: 10.1016/j.ympev.2014.08.021. Epub 2014 Aug 27.

Tail paradox, partial identifiability, and influential priors in Bayesian branch length inference.贝叶斯分支长度推断中的尾部悖论、部分可识别性和有影响的先验。

Mol Biol Evol. 2012 Jan;29(1):325-35. doi: 10.1093/molbev/msr210. Epub 2011 Sep 2.

Assessment of substitution model adequacy using frequentist and Bayesian methods.使用频率论和贝叶斯方法评估替代模型的充分性。

Mol Biol Evol. 2010 Dec;27(12):2790-803. doi: 10.1093/molbev/msq168. Epub 2010 Jul 8.

Missing data in phylogenetic analysis: reconciling results from simulations and empirical data.系统发育分析中的缺失数据：协调模拟结果与实证数据

Syst Biol. 2011 Oct;60(5):719-31. doi: 10.1093/sysbio/syr025. Epub 2011 Mar 28.

Impact of missing data on phylogenies inferred from empirical phylogenomic data sets.缺失数据对从经验系统发育基因组数据集推断的系统发育的影响。

Mol Biol Evol. 2013 Jan;30(1):197-214. doi: 10.1093/molbev/mss208. Epub 2012 Aug 28.

引用本文的文献

Teasing apart the sources of phylogenetic tree discordance across three genomes in the oak family (Fagaceae).剖析壳斗科（山毛榉科）三个基因组间系统发育树不一致的来源。

BMC Plant Biol. 2025 Jul 17;25(1):919. doi: 10.1186/s12870-025-06963-3.

Metagenomic Identification of Fusarium solani Strain as Cause of US Fungal Meningitis Outbreak Associated with Surgical Procedures in Mexico, 2023.2023年，通过宏基因组学鉴定茄病镰刀菌菌株为美国与墨西哥外科手术相关的真菌性脑膜炎疫情的病因。

Emerg Infect Dis. 2025 May;31(5):948-957. doi: 10.3201/eid3105.241657. Epub 2025 Apr 3.

Evolutionary and epidemic dynamics of COVID-19 in Germany exemplified by three Bayesian phylodynamic case studies.以三个贝叶斯系统发育动力学案例研究为例的德国新冠病毒进化与流行动态

Bioinform Biol Insights. 2025 Mar 12;19:11779322251321065. doi: 10.1177/11779322251321065. eCollection 2025.

Exploring SNP filtering strategies: the influence of strict vs soft core.探索单核苷酸多态性（SNP）过滤策略：严格核心与软核心的影响。

Microb Genom. 2025 Jan;11(1). doi: 10.1099/mgen.0.001346.

Data-driven guidelines for phylogenomic analyses using SNP data.使用单核苷酸多态性（SNP）数据进行系统发育基因组分析的数据驱动指南。

Appl Plant Sci. 2024 Aug 9;12(6):e11611. doi: 10.1002/aps3.11611. eCollection 2024 Nov-Dec.

16S rRNA phylogeny and clustering is not a reliable proxy for genome-based taxonomy in .16S rRNA 系统发育和聚类不能作为. 基于基因组的分类学的可靠替代指标。

Microb Genom. 2024 Sep;10(9). doi: 10.1099/mgen.0.001287.

Using de novo transcriptomes to decipher the relationships in cutthroat trout subspecies ().利用从头转录组来解读割喉鳟亚种之间的关系（）。

Evol Appl. 2024 Jul 11;17(7):e13735. doi: 10.1111/eva.13735. eCollection 2024 Jul.

A Guide to Phylogenomic Inference.系统发育基因组推断指南。

Methods Mol Biol. 2024;2802:267-345. doi: 10.1007/978-1-0716-3838-5_11.

Evolutionary history of arbuscular mycorrhizal fungi and genomic signatures of obligate symbiosis.丛枝菌根真菌的进化历史和专性共生的基因组特征。

BMC Genomics. 2024 May 29;25(1):529. doi: 10.1186/s12864-024-10391-2.

Central African dwarf crocodiles found in syntopy are comparably divergent to South American dwarf caimans.中非法郎发现的矮鳄与南美小凯门鳄具有可比性的分化。

Biol Lett. 2024 May;20(5):20230448. doi: 10.1098/rsbl.2023.0448. Epub 2024 May 8.

本文引用的文献

EXPERIMENTAL MOLECULAR EVOLUTION OF BACTERIOPHAGE T7.噬菌体T7的实验性分子进化

Evolution. 1993 Aug;47(4):993-1007. doi: 10.1111/j.1558-5646.1993.tb02130.x.

Phylogenetic mixtures on a single tree can mimic a tree of another topology.单棵树上的系统发育混合可以模拟出具有另一种拓扑结构的树。

Syst Biol. 2007 Oct;56(5):767-75. doi: 10.1080/10635150701627304.

MaxAlign: maximizing usable data in an alignment.最大比对：在比对中最大化可用数据。

BMC Bioinformatics. 2007 Aug 28;8:312. doi: 10.1186/1471-2105-8-312.

The importance of data partitioning and the utility of Bayes factors in Bayesian phylogenetics.数据划分在贝叶斯系统发育学中的重要性以及贝叶斯因子的效用。

Syst Biol. 2007 Aug;56(4):643-55. doi: 10.1080/10635150701546249.

Phylogeny of North American fireflies (Coleoptera: Lampyridae): implications for the evolution of light signals.北美萤火虫的系统发育（鞘翅目：萤科）：对光信号进化的启示

Mol Phylogenet Evol. 2007 Oct;45(1):33-49. doi: 10.1016/j.ympev.2007.05.013. Epub 2007 Jun 8.

Fair-balance paradox, star-tree paradox, and Bayesian phylogenetics.公平平衡悖论、星树悖论与贝叶斯系统发育学

Mol Biol Evol. 2007 Aug;24(8):1639-55. doi: 10.1093/molbev/msm081. Epub 2007 May 7.

The Bayesian "star paradox" persists for long finite sequences.贝叶斯“星型悖论”在长有限序列中持续存在。

Mol Biol Evol. 2007 Apr;24(4):1075-9. doi: 10.1093/molbev/msm028. Epub 2007 Feb 13.

The supermatrix approach to systematics.系统发育学的超矩阵方法。

Trends Ecol Evol. 2007 Jan;22(1):34-41. doi: 10.1016/j.tree.2006.10.002. Epub 2006 Oct 13.

Is there a star tree paradox?是否存在星树悖论？

Mol Biol Evol. 2006 Oct;23(10):1819-23. doi: 10.1093/molbev/msl059. Epub 2006 Jul 12.

Heterotachy and long-branch attraction in phylogenetics.系统发育学中的异速进化和长枝吸引

BMC Evol Biol. 2005 Oct 6;5:50. doi: 10.1186/1471-2148-5-50.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验