缺失数据对从经验系统发育基因组数据集推断的系统发育的影响。

Impact of missing data on phylogenies inferred from empirical phylogenomic data sets.

机构信息

Département de Biochimie, Centre Robert-Cedergren, Université de Montréal, Montréal, Québec, Canada.

出版信息

Mol Biol Evol. 2013 Jan;30(1):197-214. doi: 10.1093/molbev/mss208. Epub 2012 Aug 28.

Abstract

Progress in sequencing technology allows researchers to assemble ever-larger supermatrices for phylogenomic inference. However, current phylogenomic studies often rest on patchy data sets, with some having 80% missing (or ambiguous) data or more. Though early simulations had suggested that missing data per se do not harm phylogenetic inference when using sufficiently large data sets, Lemmon et al. (Lemmon AR, Brown JM, Stanger-Hall K, Lemmon EM. 2009. The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and Bayesian inference. Syst Biol. 58:130-145.) have recently cast doubt on this consensus in a study based on the introduction of parsimony-uninformative incomplete characters. In this work, we empirically reassess the issue of missing data in phylogenomics while exploring possible interactions with the model of sequence evolution. First, we note that parsimony-uninformative incomplete characters are actually informative in a probabilistic framework. A reanalysis of Lemmon's data set with this in mind gives a very different interpretation of their results and shows that some of their conclusions may be unfounded. Second, we investigate the effect of the progressive introduction of missing data in a complete supermatrix (126 genes × 39 species) capable of resolving animal relationships. These analyses demonstrate that missing data perturb phylogenetic inference slightly beyond the expected decrease in resolving power. In particular, they exacerbate systematic errors by reducing the number of species effectively available for the detection of multiple substitutions. Consequently, large sparse supermatrices are more sensitive to phylogenetic artifacts than smaller but less incomplete data sets, which argue for experimental designs aimed at collecting a modest number (~50) of highly covered genes. Our results further confirm that including incomplete yet short-branch taxa (i.e., slowly evolving species or close outgroups) can help to eschew artifacts, as predicted by simulations. Finally, it appears that selecting an adequate model of sequence evolution (e.g., the site-heterogeneous CAT model instead of the site-homogeneous WAG model) is more beneficial to phylogenetic accuracy than reducing the level of missing data.

摘要

测序技术的进步使研究人员能够组装越来越大的超级矩阵进行系统基因组学推断。然而，目前的系统基因组学研究通常依赖于不完整的数据，有些数据缺失率（或模糊性）达到 80%或更高。虽然早期的模拟研究表明，当使用足够大的数据集时，缺失数据本身不会损害系统发育推断，但 Lemmon 等人（Lemmon AR、Brown JM、Stanger-Hall K、Lemmon EM. 2009. 模糊数据对最大似然和贝叶斯推断获得的系统发育估计的影响。系统生物学。58:130-145.）最近对基于引入简约无信息不完整特征的共识提出了质疑。在这项工作中，我们通过探索与序列进化模型的可能相互作用，重新评估系统基因组学中缺失数据的问题。首先，我们注意到，在概率框架中，简约无信息不完整特征实际上是有信息的。考虑到这一点，对 Lemmon 数据集的重新分析给出了对其结果的截然不同的解释，并表明他们的一些结论可能没有根据。其次，我们研究了在一个能够解决动物关系的完整超级矩阵（126 个基因×39 个物种）中逐步引入缺失数据的影响。这些分析表明，缺失数据会略微扰乱系统发育推断，超出预期的分辨率下降。特别是，它们通过减少有效用于检测多个替换的物种数量来加剧系统误差。因此，与较小但不完整的数据集相比，稀疏的大超级矩阵对系统发育伪影更为敏感，这证明了旨在收集适度数量（~50）高覆盖率基因的实验设计的合理性。我们的结果进一步证实，正如模拟预测的那样，包括不完整但短分支的分类群（即缓慢进化的物种或密切的外群）可以帮助避免伪影。最后，似乎选择适当的序列进化模型（例如，站点异质 CAT 模型而不是站点同质 WAG 模型）比减少缺失数据水平更有利于系统发育准确性。

相似文献

Impact of missing data on phylogenies inferred from empirical phylogenomic data sets.缺失数据对从经验系统发育基因组数据集推断的系统发育的影响。

Mol Biol Evol. 2013 Jan;30(1):197-214. doi: 10.1093/molbev/mss208. Epub 2012 Aug 28.

The use and validity of composite taxa in phylogenetic analysis.复合分类单元在系统发育分析中的使用和有效性。

Syst Biol. 2009 Dec;58(6):560-72. doi: 10.1093/sysbio/syp056. Epub 2009 Sep 21.

An empirical assessment of long-branch attraction artefacts in deep eukaryotic phylogenomics.对深层真核生物系统发育基因组学中长枝吸引假象的实证评估。

Syst Biol. 2005 Oct;54(5):743-57. doi: 10.1080/10635150500234609.

Improvement of molecular phylogenetic inference and the phylogeny of Bilateria.分子系统发育推断及两侧对称动物系统发育的改进。

Philos Trans R Soc Lond B Biol Sci. 2008 Apr 27;363(1496):1463-72. doi: 10.1098/rstb.2007.2236.

Phylogeny of Bembidion and related ground beetles (Coleoptera: Carabidae: Trechinae: Bembidiini: Bembidiina).Bembidion 及相关步甲（鞘翅目：步甲科：步甲亚科：拟步甲族：拟步甲亚族）的系统发育。

Mol Phylogenet Evol. 2012 Jun;63(3):533-76. doi: 10.1016/j.ympev.2012.01.015. Epub 2012 Mar 13.

Different phylogenomic approaches to resolve the evolutionary relationships among model fish species.不同的系统基因组学方法来解决模式鱼类物种间的进化关系。

Mol Biol Evol. 2010 Dec;27(12):2757-74. doi: 10.1093/molbev/msq165. Epub 2010 Jun 29.

SDM: a fast distance-based approach for (super) tree building in phylogenomics.SDM：一种用于系统发育基因组学中（超）树构建的基于距离的快速方法。

Syst Biol. 2006 Oct;55(5):740-55. doi: 10.1080/10635150600969872.

Molecular phylogeny of the carnivora (mammalia): assessing the impact of increased sampling on resolving enigmatic relationships.食肉目（哺乳纲）的分子系统发育：评估增加采样对解决神秘关系的影响。

Syst Biol. 2005 Apr;54(2):317-37. doi: 10.1080/10635150590923326.

Radical instability and spurious branch support by likelihood when applied to matrices with non-random distributions of missing data.当应用于具有非随机缺失数据分布的矩阵时，似然法会导致激进的不稳定性和虚假的分支支持。

Mol Phylogenet Evol. 2012 Jan;62(1):472-84. doi: 10.1016/j.ympev.2011.10.017. Epub 2011 Oct 31.

Sparse supermatrices for phylogenetic inference: taxonomy, alignment, rogue taxa, and the phylogeny of living turtles.用于系统发育推断的稀疏超级矩阵：分类学、比对、异常分类单元和活海龟的系统发育。

Syst Biol. 2010 Jan;59(1):42-58. doi: 10.1093/sysbio/syp075. Epub 2009 Nov 11.

引用本文的文献

Opportunities and Challenges in Applying AI to Evolutionary Morphology.将人工智能应用于进化形态学的机遇与挑战。

Integr Org Biol. 2024 Sep 23;6(1):obae036. doi: 10.1093/iob/obae036. eCollection 2024.

Phylogenetic Signal in Primate Tooth Enamel Proteins and its Relevance for Paleoproteomics.灵长类动物牙釉质蛋白中的系统发育信号及其对古蛋白质组学的意义。

Genome Biol Evol. 2025 Feb 3;17(2). doi: 10.1093/gbe/evaf007.

BAD2matrix: Phylogenomic matrix concatenation, indel coding, and more.BAD2矩阵：系统发育基因组矩阵拼接、插入缺失编码及更多内容。

Appl Plant Sci. 2024 Sep 24;12(6):e11604. doi: 10.1002/aps3.11604. eCollection 2024 Nov-Dec.

Data-driven guidelines for phylogenomic analyses using SNP data.使用单核苷酸多态性（SNP）数据进行系统发育基因组分析的数据驱动指南。

Appl Plant Sci. 2024 Aug 9;12(6):e11611. doi: 10.1002/aps3.11611. eCollection 2024 Nov-Dec.

Orthoptera-specific target enrichment (OR-TE) probes resolve relationships over broad phylogenetic scales.直翅目特异性靶向富集（OR-TE）探针可解决广泛的系统发育尺度上的关系。

Sci Rep. 2024 Sep 13;14(1):21377. doi: 10.1038/s41598-024-72622-6.

Genomes of from Panama and from Brazil: Expansion of Multigene Families in Leishmaniinae Parasites That Are Close Relatives of spp.来自巴拿马和巴西的利什曼原虫基因组：利什曼亚科寄生虫（与杜氏利什曼原虫为近亲）中多基因家族的扩张

Pathogens. 2023 Nov 30;12(12):1409. doi: 10.3390/pathogens12121409.

A Reinvestigation of Multiple Independent Evolution and Triassic-Jurassic Origins of Multicellular Volvocine Algae.多细胞绿球藻的多次独立进化和三叠纪-侏罗纪起源的再研究。

Genome Biol Evol. 2023 Aug 1;15(8). doi: 10.1093/gbe/evad142.

Redefining Possible: Combining Phylogenomic and Supersparse Data in Frogs.重新定义可能：结合系统基因组学和超级稀疏数据研究蛙类。

Mol Biol Evol. 2023 May 2;40(5). doi: 10.1093/molbev/msad109.

Comparison of chloroplast genomes and phylogenomics in the Ficus sarmentosa complex (Moraceae).相思树复合体（桑科）的叶绿体基因组和系统发育基因组比较。

PLoS One. 2022 Dec 30;17(12):e0279849. doi: 10.1371/journal.pone.0279849. eCollection 2022.

Complete mitochondrial genomes and updated divergence time of the two freshwater clupeids endemic to Lake Tanganyika (Africa) suggest intralacustrine speciation.完整的线粒体基因组和更新的坦噶尼喀湖（非洲）两种特有淡水鲱鱼类的分化时间表明了湖泊内的物种形成。

BMC Ecol Evol. 2022 Nov 3;22(1):127. doi: 10.1186/s12862-022-02085-8.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

缺失数据对从经验系统发育基因组数据集推断的系统发育的影响。

Impact of missing data on phylogenies inferred from empirical phylogenomic data sets.

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献