利用物种间基因覆盖不完整的分子数据构建大型时间树的前景。

Prospects for building large timetrees using molecular data with incomplete gene coverage among species.

作者信息

Filipski Alan, Murillo Oscar, Freydenzon Anna, Tamura Koichiro, Kumar Sudhir

机构信息

Center for Evolutionary Medicine and Informatics, Biodesign Institute, Arizona State University.

Center for Evolutionary Medicine and Informatics, Biodesign Institute, Arizona State UniversitySchool of Life Sciences, Arizona State University.

出版信息

Mol Biol Evol. 2014 Sep;31(9):2542-50. doi: 10.1093/molbev/msu200. Epub 2014 Jun 27.

DOI:10.1093/molbev/msu200

PMID:24974376

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4137717/

Abstract

Scientists are assembling sequence data sets from increasing numbers of species and genes to build comprehensive timetrees. However, data are often unavailable for some species and gene combinations, and the proportion of missing data is often large for data sets containing many genes and species. Surprisingly, there has not been a systematic analysis of the effect of the degree of sparseness of the species-gene matrix on the accuracy of divergence time estimates. Here, we present results from computer simulations and empirical data analyses to quantify the impact of missing gene data on divergence time estimation in large phylogenies. We found that estimates of divergence times were robust even when sequences from a majority of genes for most of the species were absent. From the analysis of such extremely sparse data sets, we found that the most egregious errors occurred for nodes in the tree that had no common genes for any pair of species in the immediate descendant clades of the node in question. These problematic nodes can be easily detected prior to computational analyses based only on the input sequence alignment and the tree topology. We conclude that it is best to use larger alignments, because adding both genes and species to the alignment augments the number of genes available for estimating divergence events deep in the tree and improves their time estimates.

摘要

科学家们正在收集越来越多物种和基因的序列数据集，以构建全面的时间树。然而，某些物种和基因组合的数据往往无法获取，而且对于包含许多基因和物种的数据集来说，缺失数据的比例通常很大。令人惊讶的是，尚未对物种 - 基因矩阵的稀疏程度对分歧时间估计准确性的影响进行系统分析。在此，我们展示了计算机模拟和实证数据分析的结果，以量化缺失基因数据对大型系统发育中分歧时间估计的影响。我们发现，即使大多数物种的大多数基因序列缺失，分歧时间的估计仍然稳健。通过对如此极端稀疏的数据集进行分析，我们发现，对于所讨论节点的直接后代分支中任何一对物种都没有共同基因的树节点，会出现最严重的错误。仅基于输入序列比对和树拓扑结构，这些有问题的节点在计算分析之前就可以很容易地被检测到。我们得出结论，最好使用更大的比对，因为在比对中同时添加基因和物种会增加可用于估计树中深处分歧事件的基因数量，并改善对它们的时间估计。

相似文献

Prospects for building large timetrees using molecular data with incomplete gene coverage among species.利用物种间基因覆盖不完整的分子数据构建大型时间树的前景。

Mol Biol Evol. 2014 Sep;31(9):2542-50. doi: 10.1093/molbev/msu200. Epub 2014 Jun 27.

SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees.SATe-II：一种非常快速且准确的同时估计多个序列比对和系统发育树的方法。

Syst Biol. 2012 Jan;61(1):90-106. doi: 10.1093/sysbio/syr095. Epub 2011 Dec 1.

Detecting the limits of regulatory element conservation and divergence estimation using pairwise and multiple alignments.利用两两比对和多重比对检测调控元件保守性和分歧估计的限度。

BMC Bioinformatics. 2006 Aug 14;7:376. doi: 10.1186/1471-2105-7-376.

StarBEAST2 Brings Faster Species Tree Inference and Accurate Estimates of Substitution Rates.StarBEAST2实现了更快的物种树推断和替换率的准确估计。

Mol Biol Evol. 2017 Aug 1;34(8):2101-2114. doi: 10.1093/molbev/msx126.

TimeTree: A Resource for Timelines, Timetrees, and Divergence Times.TimeTree：一个用于时间线、时间树和分歧时间的资源。

Mol Biol Evol. 2017 Jul 1;34(7):1812-1819. doi: 10.1093/molbev/msx116.

A new method for inferring timetrees from temporally sampled molecular sequences.一种从时间采样的分子序列推断时间树的新方法。

PLoS Comput Biol. 2020 Jan 17;16(1):e1007046. doi: 10.1371/journal.pcbi.1007046. eCollection 2020 Jan.

Impact of missing data on phylogenies inferred from empirical phylogenomic data sets.缺失数据对从经验系统发育基因组数据集推断的系统发育的影响。

Mol Biol Evol. 2013 Jan;30(1):197-214. doi: 10.1093/molbev/mss208. Epub 2012 Aug 28.

Impact of duplicate gene copies on phylogenetic analysis and divergence time estimates in butterflies.重复基因拷贝对蝴蝶系统发育分析和分歧时间估计的影响。

BMC Evol Biol. 2009 May 13;9:99. doi: 10.1186/1471-2148-9-99.

Bayesian coestimation of phylogeny and sequence alignment.系统发育与序列比对的贝叶斯联合估计

BMC Bioinformatics. 2005 Apr 1;6:83. doi: 10.1186/1471-2105-6-83.

Multiple sequence alignment: in pursuit of homologous DNA positions.多序列比对：寻找同源DNA位置。

Genome Res. 2007 Feb;17(2):127-35. doi: 10.1101/gr.5232407.

引用本文的文献

The Spread of Antibiotic Resistance Is Driven by Plasmids Among the Fastest Evolving and of Broadest Host Range.抗生素耐药性的传播是由质粒推动的，质粒是进化最快且宿主范围最广的。

Mol Biol Evol. 2025 Mar 5;42(3). doi: 10.1093/molbev/msaf060.

A Phylogenomic Backbone for Acoelomorpha Inferred From Transcriptomic Data.基于转录组数据推断的无肠动物系统基因组骨架

Syst Biol. 2025 Feb 10;74(1):70-85. doi: 10.1093/sysbio/syae057.

Identification of a dCache-type chemoreceptor in that specifically mediates chemotaxis towards methyl pyruvate.在[具体对象]中鉴定出一种dCache型化学感受器，其特异性介导对丙酮酸钠的趋化作用。

Front Microbiol. 2024 May 9;15:1400284. doi: 10.3389/fmicb.2024.1400284. eCollection 2024.

Comprehensive Comparative Analysis of the Gene Family in Common Wheat () and Its D-Subgenome Donor .普通小麦（）及其D亚基因组供体中基因家族的综合比较分析

Plants (Basel). 2024 Apr 30;13(9):1259. doi: 10.3390/plants13091259.

Forty Years of Inferential Methods in the Journals of the Society for Molecular Biology and Evolution.《分子生物学与进化学会期刊》中的推理方法四十年。

Mol Biol Evol. 2024 Jan 3;41(1). doi: 10.1093/molbev/msad264.

Characterizations of novel broad-spectrum lytic bacteriophages and infecting MDR spp. with their application on raw chicken to reduce the load.新型广谱裂解性噬菌体的特性及其对耐多药菌的感染作用，以及它们在生鸡肉上的应用以减少菌载量。

Front Microbiol. 2023 Nov 29;14:1240570. doi: 10.3389/fmicb.2023.1240570. eCollection 2023.

Assessing the relative performance of fast molecular dating methods for phylogenomic data.评估系统发生基因组数据快速分子定年方法的相对性能。

BMC Genomics. 2022 Dec 3;23(1):798. doi: 10.1186/s12864-022-09030-5.

annotation of unreviewed acetylcholinesterase (AChE) in some lepidopteran insect pest species reveals the causes of insecticide resistance.对一些鳞翅目害虫物种中未审查的乙酰胆碱酯酶（AChE）的注释揭示了抗药性的原因。

Saudi J Biol Sci. 2021 Apr;28(4):2197-2209. doi: 10.1016/j.sjbs.2021.01.007. Epub 2021 Jan 21.

PanACoTA: a modular tool for massive microbial comparative genomics.PanACoTA：一种用于大规模微生物比较基因组学的模块化工具。

NAR Genom Bioinform. 2021 Jan 12;3(1):lqaa106. doi: 10.1093/nargab/lqaa106. eCollection 2021 Mar.

Phylogenetic background and habitat drive the genetic diversification of Escherichia coli.进化背景和生境驱动大肠杆菌的遗传多样化。

PLoS Genet. 2020 Jun 12;16(6):e1008866. doi: 10.1371/journal.pgen.1008866. eCollection 2020 Jun.

本文引用的文献

MEGA6: Molecular Evolutionary Genetics Analysis version 6.0.MEGA6：分子进化遗传学分析版本 6.0。

Mol Biol Evol. 2013 Dec;30(12):2725-9. doi: 10.1093/molbev/mst197. Epub 2013 Oct 16.

Estimating divergence times in large molecular phylogenies.估计大型分子系统发育中的分歧时间。

Proc Natl Acad Sci U S A. 2012 Nov 20;109(47):19333-8. doi: 10.1073/pnas.1213199109. Epub 2012 Nov 5.

Impact of missing data on phylogenies inferred from empirical phylogenomic data sets.缺失数据对从经验系统发育基因组数据集推断的系统发育的影响。

Mol Biol Evol. 2013 Jan;30(1):197-214. doi: 10.1093/molbev/mss208. Epub 2012 Aug 28.

Highly incomplete taxa can rescue phylogenetic analyses from the negative impacts of limited taxon sampling.高度不完全分类单元可以从有限的分类单元采样的负面影响中拯救系统发育分析。

PLoS One. 2012;7(8):e42925. doi: 10.1371/journal.pone.0042925. Epub 2012 Aug 10.

Rate variation and estimation of divergence times using strict and relaxed clocks.使用严格时钟和松弛时钟估计分歧时间的速率变化。

BMC Evol Biol. 2011 Sep 26;11:271. doi: 10.1186/1471-2148-11-271.

Impacts of the Cretaceous Terrestrial Revolution and KPg extinction on mammal diversification.白垩纪陆地革命和 K-Pg 灭绝事件对哺乳动物多样性的影响。

Science. 2011 Oct 28;334(6055):521-4. doi: 10.1126/science.1211028. Epub 2011 Sep 22.

Missing data in phylogenetic analysis: reconciling results from simulations and empirical data.系统发育分析中的缺失数据：协调模拟结果与实证数据

Syst Biol. 2011 Oct;60(5):719-31. doi: 10.1093/sysbio/syr025. Epub 2011 Mar 28.

The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and Bayesian inference.歧义数据对最大似然法和贝叶斯推断得出的系统发育估计的影响。

Syst Biol. 2009 Feb;58(1):130-45. doi: 10.1093/sysbio/syp017. Epub 2009 May 22.

Phylogenomics with incomplete taxon coverage: the limits to inference.不完全分类群覆盖的系统基因组学：推断的局限性。

BMC Evol Biol. 2010 May 25;10:155. doi: 10.1186/1471-2148-10-155.

Performance of relaxed-clock methods in estimating evolutionary divergence times and their credibility intervals.松弛时钟方法在估计进化分歧时间及其置信区间方面的性能。

Mol Biol Evol. 2010 Jun;27(6):1289-300. doi: 10.1093/molbev/msq014. Epub 2010 Jan 21.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验