Thornlow Bryan, Kramer Alexander, Ye Cheng, De Maio Nicola, McBroome Jakob, Hinrichs Angie S, Lanfear Robert, Turakhia Yatish, Corbett-Detig Russell
Department of Biomolecular Engineering, University of California, Santa Cruz; Santa Cruz, CA 95064, USA.
Genomics Institute, University of California, Santa Cruz; Santa Cruz, CA 95064, USA.
bioRxiv. 2022 May 18:2021.12.02.471004. doi: 10.1101/2021.12.02.471004.
Phylogenetics has been foundational to SARS-CoV-2 research and public health policy, assisting in genomic surveillance, contact tracing, and assessing emergence and spread of new variants. However, phylogenetic analyses of SARS-CoV-2 have often relied on tools designed for phylogenetic inference, in which all data are collected before any analysis is performed and the phylogeny is inferred once from scratch. SARS-CoV-2 datasets do not fit this mould. There are currently over 10 million sequenced SARS-CoV-2 genomes in online databases, with tens of thousands of new genomes added every day. Continuous data collection, combined with the public health relevance of SARS-CoV-2, invites an "online" approach to phylogenetics, in which new samples are added to existing phylogenetic trees every day. The extremely dense sampling of SARS-CoV-2 genomes also invites a comparison between likelihood and parsimony approaches to phylogenetic inference. Maximum likelihood (ML) methods are more accurate when there are multiple changes at a single site on a single branch, but this accuracy comes at a large computational cost, and the dense sampling of SARS-CoV-2 genomes means that these instances will be extremely rare because each internal branch is expected to be extremely short. Therefore, it may be that approaches based on maximum parsimony (MP) are sufficiently accurate for reconstructing phylogenies of SARS-CoV-2, and their simplicity means that they can be applied to much larger datasets. Here, we evaluate the performance of and online phylogenetic approaches, and ML and MP frameworks, for inferring large and dense SARS-CoV-2 phylogenies. Overall, we find that online phylogenetics produces similar phylogenetic trees to analyses for SARS-CoV-2, and that MP optimizations produce more accurate SARS-CoV-2 phylogenies than do ML optimizations. Since MP is thousands of times faster than presently available implementations of ML and online phylogenetics is faster than , we therefore propose that, in the context of comprehensive genomic epidemiology of SARS-CoV-2, MP online phylogenetics approaches should be favored.
系统发育学一直是新冠病毒研究和公共卫生政策的基础,有助于进行基因组监测、接触者追踪以及评估新变种的出现和传播。然而,新冠病毒的系统发育分析通常依赖于为系统发育推断设计的工具,在这种工具中,所有数据在任何分析进行之前就已收集,并且系统发育是从头开始一次性推断出来的。新冠病毒数据集并不符合这种模式。目前在线数据库中有超过1000万个已测序的新冠病毒基因组,每天还会新增数以万计的新基因组。持续的数据收集,再加上新冠病毒与公共卫生的相关性,促使采用一种“在线”的系统发育学方法,即每天将新样本添加到现有的系统发育树中。新冠病毒基因组的极高密度采样也促使人们对系统发育推断的似然法和简约法进行比较。当单个分支上的单个位点发生多次变化时,最大似然(ML)方法更准确,但这种准确性是以巨大的计算成本为代价的,而且新冠病毒基因组的高密度采样意味着这些情况将极其罕见,因为每个内部分支预计都非常短。因此,基于最大简约(MP)的方法可能对于重建新冠病毒的系统发育足够准确且其简单性意味着它们可以应用于大得多的数据集。在这里,我们评估了在线系统发育方法以及ML和MP框架在推断大型且密集的新冠病毒系统发育方面的性能。总体而言,我们发现对于新冠病毒,在线系统发育学产生的系统发育树与传统分析产生的相似,并且MP优化产生的新冠病毒系统发育树比ML优化产生的更准确。由于MP比目前可用的ML实现快数千倍且在线系统发育学比传统方法更快,因此我们建议,在新冠病毒全面基因组流行病学的背景下,应优先采用MP在线系统发育学方法。