在线系统发育学与 matOptimize 产生等效的树，并且比从头开始和最大似然实现对大型 SARS-CoV-2 系统发育更有效率。

Online Phylogenetics with matOptimize Produces Equivalent Trees and is Dramatically More Efficient for Large SARS-CoV-2 Phylogenies than de novo and Maximum-Likelihood Implementations.

机构信息

Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA.

Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA.

出版信息

Syst Biol. 2023 Nov 1;72(5):1039-1051. doi: 10.1093/sysbio/syad031.

DOI:10.1093/sysbio/syad031

PMID:37232476

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10627557/

Abstract

Phylogenetics has been foundational to SARS-CoV-2 research and public health policy, assisting in genomic surveillance, contact tracing, and assessing emergence and spread of new variants. However, phylogenetic analyses of SARS-CoV-2 have often relied on tools designed for de novo phylogenetic inference, in which all data are collected before any analysis is performed and the phylogeny is inferred once from scratch. SARS-CoV-2 data sets do not fit this mold. There are currently over 14 million sequenced SARS-CoV-2 genomes in online databases, with tens of thousands of new genomes added every day. Continuous data collection, combined with the public health relevance of SARS-CoV-2, invites an "online" approach to phylogenetics, in which new samples are added to existing phylogenetic trees every day. The extremely dense sampling of SARS-CoV-2 genomes also invites a comparison between likelihood and parsimony approaches to phylogenetic inference. Maximum likelihood (ML) and pseudo-ML methods may be more accurate when there are multiple changes at a single site on a single branch, but this accuracy comes at a large computational cost, and the dense sampling of SARS-CoV-2 genomes means that these instances will be extremely rare because each internal branch is expected to be extremely short. Therefore, it may be that approaches based on maximum parsimony (MP) are sufficiently accurate for reconstructing phylogenies of SARS-CoV-2, and their simplicity means that they can be applied to much larger data sets. Here, we evaluate the performance of de novo and online phylogenetic approaches, as well as ML, pseudo-ML, and MP frameworks for inferring large and dense SARS-CoV-2 phylogenies. Overall, we find that online phylogenetics produces similar phylogenetic trees to de novo analyses for SARS-CoV-2, and that MP optimization with UShER and matOptimize produces equivalent SARS-CoV-2 phylogenies to some of the most popular ML and pseudo-ML inference tools. MP optimization with UShER and matOptimize is thousands of times faster than presently available implementations of ML and online phylogenetics is faster than de novo inference. Our results therefore suggest that parsimony-based methods like UShER and matOptimize represent an accurate and more practical alternative to established ML implementations for large SARS-CoV-2 phylogenies and could be successfully applied to other similar data sets with particularly dense sampling and short branch lengths.

摘要

系统发生学一直是 SARS-CoV-2 研究和公共卫生政策的基础，有助于进行基因组监测、接触者追踪，并评估新变体的出现和传播。然而，SARS-CoV-2 的系统发生分析通常依赖于为从头系统发生推断而设计的工具，其中所有数据都是在进行任何分析之前收集的，并且从一开始就从头推断系统发生。SARS-CoV-2 数据集不符合这种模式。目前，在线数据库中有超过 1400 万个测序的 SARS-CoV-2 基因组，每天新增数万个新基因组。连续的数据收集，加上 SARS-CoV-2 的公共卫生相关性，邀请了一种“在线”系统发生方法，即每天将新样本添加到现有的系统发生树中。SARS-CoV-2 基因组的极度密集采样也邀请了对系统发生推断的似然法和简约法进行比较。当单个分支上的单个位置发生多次变化时，最大似然法（ML）和伪 ML 方法可能更准确，但这种准确性需要大量的计算成本，并且 SARS-CoV-2 基因组的密集采样意味着这些情况将极为罕见，因为每个内部分支预计都非常短。因此，基于最大简约法（MP）的方法对于重建 SARS-CoV-2 的系统发生可能已经足够准确，而且它们的简单性意味着它们可以应用于更大的数据。在这里，我们评估了从头开始和在线系统发生方法的性能，以及用于推断大型和密集 SARS-CoV-2 系统发生的 ML、伪 ML 和 MP 框架。总的来说，我们发现在线系统发生学为 SARS-CoV-2 产生的系统发生树与从头分析相似，并且 UShER 和 matOptimize 的 MP 优化与一些最流行的 ML 和伪 ML 推断工具产生的 SARS-CoV-2 系统发生相同。UShER 和 matOptimize 的 MP 优化比目前可用的 ML 实现快数千倍，而在线系统发生比从头推断快。因此，我们的结果表明，像 UShER 和 matOptimize 这样的基于简约法的方法代表了一种准确且更实用的替代方法，可用于大型 SARS-CoV-2 系统发生，并且可以成功应用于具有特别密集采样和短分支长度的其他类似数据集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e2fd/10627557/c04ab2b7bdbd/syad031_fig1.jpg

相似文献

Online Phylogenetics with matOptimize Produces Equivalent Trees and is Dramatically More Efficient for Large SARS-CoV-2 Phylogenies than de novo and Maximum-Likelihood Implementations.

Syst Biol. 2023 Nov 1;72(5):1039-1051. doi: 10.1093/sysbio/syad031.

Online Phylogenetics using Parsimony Produces Slightly Better Trees and is Dramatically More Efficient for Large SARS-CoV-2 Phylogenies than and Maximum-Likelihood Approaches.

bioRxiv. 2022 May 18:2021.12.02.471004. doi: 10.1101/2021.12.02.471004.

matOptimize: a parallel tree optimization method enables online phylogenetics for SARS-CoV-2.

Bioinformatics. 2022 Aug 2;38(15):3734-3740. doi: 10.1093/bioinformatics/btac401.

Taxonium, a web-based tool for exploring large phylogenetic trees.

Elife. 2022 Nov 15;11:e82392. doi: 10.7554/eLife.82392.

Pandemic-scale phylogenetics.

bioRxiv. 2021 Dec 6:2021.12.03.470766. doi: 10.1101/2021.12.03.470766.

Maximum likelihood pandemic-scale phylogenetics.

Nat Genet. 2023 May;55(5):746-752. doi: 10.1038/s41588-023-01368-0. Epub 2023 Apr 10.

TopHap: rapid inference of key phylogenetic structures from common haplotypes in large genome collections with limited diversity.

Bioinformatics. 2022 May 13;38(10):2719-2726. doi: 10.1093/bioinformatics/btac186.

Robust expansion of phylogeny for fast-growing genome sequence data.

PLoS Comput Biol. 2024 Feb 8;20(2):e1011871. doi: 10.1371/journal.pcbi.1011871. eCollection 2024 Feb.

Maximum parsimony, substitution model, and probability phylogenetic trees.

J Comput Biol. 2011 Jan;18(1):67-80. doi: 10.1089/cmb.2009.0232. Epub 2010 Jul 12.

A Daily-Updated Database and Tools for Comprehensive SARS-CoV-2 Mutation-Annotated Trees.

Mol Biol Evol. 2021 Dec 9;38(12):5819-5824. doi: 10.1093/molbev/msab264.

引用本文的文献

UShER-TB: Scalable, Comprehensive, Accessible Phylogenomic Analysis of .

medRxiv. 2025 Jul 23:2025.07.22.25331806. doi: 10.1101/2025.07.22.25331806.

Algorithms to reconstruct past indels: The deletion-only parsimony problem.

PLoS Comput Biol. 2025 Jul 28;21(7):e1012585. doi: 10.1371/journal.pcbi.1012585. eCollection 2025 Jul.

Evolutionary and epidemic dynamics of COVID-19 in Germany exemplified by three Bayesian phylodynamic case studies.

Bioinform Biol Insights. 2025 Mar 12;19:11779322251321065. doi: 10.1177/11779322251321065. eCollection 2025.

Challenges in Assembling the Dated Tree of Life.

Genome Biol Evol. 2024 Oct 9;16(10). doi: 10.1093/gbe/evae229.

Phylogenetic Tree Instability After Taxon Addition: Empirical Frequency, Predictability, and Consequences For Online Inference.

Syst Biol. 2025 Feb 10;74(1):101-111. doi: 10.1093/sysbio/syae059.

Modeling Substitution Rate Evolution across Lineages and Relaxing the Molecular Clock.

Genome Biol Evol. 2024 Sep 3;16(9). doi: 10.1093/gbe/evae199.

Please Mind the Gap: Indel-Aware Parsimony for Fast and Accurate Ancestral Sequence Reconstruction and Multiple Sequence Alignment Including Long Indels.

Mol Biol Evol. 2024 Jul 3;41(7). doi: 10.1093/molbev/msae109.

Scalable neighbour search and alignment with uvaia.

PeerJ. 2024 Mar 6;12:e16890. doi: 10.7717/peerj.16890. eCollection 2024.

SARS-CoV-2 lineage assignments using phylogenetic placement/UShER are superior to pangoLEARN machine-learning method.

Virus Evol. 2024 Jan 11;10(1):vead085. doi: 10.1093/ve/vead085. eCollection 2024.

Representing and extending ensembles of parsimonious evolutionary histories with a directed acyclic graph.

J Math Biol. 2023 Oct 25;87(5):75. doi: 10.1007/s00285-023-02006-3.

本文引用的文献

DecentTree: scalable Neighbour-Joining for the genomic era.

Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad536.

Maximum likelihood pandemic-scale phylogenetics.

Nat Genet. 2023 May;55(5):746-752. doi: 10.1038/s41588-023-01368-0. Epub 2023 Apr 10.

Taxonium, a web-based tool for exploring large phylogenetic trees.

Elife. 2022 Nov 15;11:e82392. doi: 10.7554/eLife.82392.

Pandemic-scale phylogenomics reveals the SARS-CoV-2 recombination landscape.

Nature. 2022 Sep;609(7929):994-997. doi: 10.1038/s41586-022-05189-9. Epub 2022 Aug 11.

matOptimize: a parallel tree optimization method enables online phylogenetics for SARS-CoV-2.

Bioinformatics. 2022 Aug 2;38(15):3734-3740. doi: 10.1093/bioinformatics/btac401.

phastSim: Efficient simulation of sequence evolution for pandemic-scale datasets.

PLoS Comput Biol. 2022 Apr 29;18(4):e1010056. doi: 10.1371/journal.pcbi.1010056. eCollection 2022 Apr.

Is ACCTRAN better than DELTRAN?

Cladistics. 2008 Dec;24(6):1032-1038. doi: 10.1111/j.1096-0031.2008.00229.x. Epub 2008 Aug 28.

Genomic Sequencing of SARS-CoV-2 E484K Variant B.1.243.1, Arizona, USA.

Emerg Infect Dis. 2021 Oct;27(10):2718-2720. doi: 10.3201/eid2710.211189.

Generation and transmission of interlineage recombinants in the SARS-CoV-2 pandemic.

Cell. 2021 Sep 30;184(20):5179-5188.e8. doi: 10.1016/j.cell.2021.08.014. Epub 2021 Aug 17.

Phylogenetic Signal and Bias in Paleontology.

Syst Biol. 2022 Jun 16;71(4):986-1008. doi: 10.1093/sysbio/syab072.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

在线系统发育学与 matOptimize 产生等效的树，并且比从头开始和最大似然实现对大型 SARS-CoV-2 系统发育更有效率。

Online Phylogenetics with matOptimize Produces Equivalent Trees and is Dramatically More Efficient for Large SARS-CoV-2 Phylogenies than de novo and Maximum-Likelihood Implementations.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献