Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA.
Genomics Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA.
Syst Biol. 2023 Nov 1;72(5):1039-1051. doi: 10.1093/sysbio/syad031.
Phylogenetics has been foundational to SARS-CoV-2 research and public health policy, assisting in genomic surveillance, contact tracing, and assessing emergence and spread of new variants. However, phylogenetic analyses of SARS-CoV-2 have often relied on tools designed for de novo phylogenetic inference, in which all data are collected before any analysis is performed and the phylogeny is inferred once from scratch. SARS-CoV-2 data sets do not fit this mold. There are currently over 14 million sequenced SARS-CoV-2 genomes in online databases, with tens of thousands of new genomes added every day. Continuous data collection, combined with the public health relevance of SARS-CoV-2, invites an "online" approach to phylogenetics, in which new samples are added to existing phylogenetic trees every day. The extremely dense sampling of SARS-CoV-2 genomes also invites a comparison between likelihood and parsimony approaches to phylogenetic inference. Maximum likelihood (ML) and pseudo-ML methods may be more accurate when there are multiple changes at a single site on a single branch, but this accuracy comes at a large computational cost, and the dense sampling of SARS-CoV-2 genomes means that these instances will be extremely rare because each internal branch is expected to be extremely short. Therefore, it may be that approaches based on maximum parsimony (MP) are sufficiently accurate for reconstructing phylogenies of SARS-CoV-2, and their simplicity means that they can be applied to much larger data sets. Here, we evaluate the performance of de novo and online phylogenetic approaches, as well as ML, pseudo-ML, and MP frameworks for inferring large and dense SARS-CoV-2 phylogenies. Overall, we find that online phylogenetics produces similar phylogenetic trees to de novo analyses for SARS-CoV-2, and that MP optimization with UShER and matOptimize produces equivalent SARS-CoV-2 phylogenies to some of the most popular ML and pseudo-ML inference tools. MP optimization with UShER and matOptimize is thousands of times faster than presently available implementations of ML and online phylogenetics is faster than de novo inference. Our results therefore suggest that parsimony-based methods like UShER and matOptimize represent an accurate and more practical alternative to established ML implementations for large SARS-CoV-2 phylogenies and could be successfully applied to other similar data sets with particularly dense sampling and short branch lengths.
系统发生学一直是 SARS-CoV-2 研究和公共卫生政策的基础,有助于进行基因组监测、接触者追踪,并评估新变体的出现和传播。然而,SARS-CoV-2 的系统发生分析通常依赖于为从头系统发生推断而设计的工具,其中所有数据都是在进行任何分析之前收集的,并且从一开始就从头推断系统发生。SARS-CoV-2 数据集不符合这种模式。目前,在线数据库中有超过 1400 万个测序的 SARS-CoV-2 基因组,每天新增数万个新基因组。连续的数据收集,加上 SARS-CoV-2 的公共卫生相关性,邀请了一种“在线”系统发生方法,即每天将新样本添加到现有的系统发生树中。SARS-CoV-2 基因组的极度密集采样也邀请了对系统发生推断的似然法和简约法进行比较。当单个分支上的单个位置发生多次变化时,最大似然法(ML)和伪 ML 方法可能更准确,但这种准确性需要大量的计算成本,并且 SARS-CoV-2 基因组的密集采样意味着这些情况将极为罕见,因为每个内部分支预计都非常短。因此,基于最大简约法(MP)的方法对于重建 SARS-CoV-2 的系统发生可能已经足够准确,而且它们的简单性意味着它们可以应用于更大的数据。在这里,我们评估了从头开始和在线系统发生方法的性能,以及用于推断大型和密集 SARS-CoV-2 系统发生的 ML、伪 ML 和 MP 框架。总的来说,我们发现在线系统发生学为 SARS-CoV-2 产生的系统发生树与从头分析相似,并且 UShER 和 matOptimize 的 MP 优化与一些最流行的 ML 和伪 ML 推断工具产生的 SARS-CoV-2 系统发生相同。UShER 和 matOptimize 的 MP 优化比目前可用的 ML 实现快数千倍,而在线系统发生比从头推断快。因此,我们的结果表明,像 UShER 和 matOptimize 这样的基于简约法的方法代表了一种准确且更实用的替代方法,可用于大型 SARS-CoV-2 系统发生,并且可以成功应用于具有特别密集采样和短分支长度的其他类似数据集。