Institute for Comparative Genomics, American Museum of Natural History, New York, New York 10024, USA;
Section for Hologenomics, The Globe Institute, University of Copenhagen, DK-1353 Copenhagen, Denmark.
Genome Res. 2024 Oct 29;34(10):1651-1660. doi: 10.1101/gr.278594.123.
The COVID-19 pandemic has highlighted the critical role of genomic surveillance for guiding policy and control. Timeliness is key, but sequence alignment and phylogeny slow most surveillance techniques. Millions of SARS-CoV-2 genomes have been assembled. Phylogenetic methods are ill equipped to handle this sheer scale. We introduce a pangenomic measure that examines the information diversity of a -mer library drawn from a country's complete set of clinical, pooled, or wastewater sequence. Quantifying diversity is central to ecology. Hill numbers, or the effective number of species in a sample, provide a simple metric for comparing species diversity across environments. The more diverse the sample, the higher the Hill number. We adopt this ecological approach and consider each -mer an individual and each genome a transect in the pangenome of the species. Structured in this way, Hill numbers summarize the temporal trajectory of pandemic variants, collapsing each day's assemblies into genome equivalents. For pooled or wastewater sequence, we instead compare days using survey sequence divorced from individual infections. Across data from the UK, USA, and South Africa, we trace the ascendance of new variants of concern as they emerge in local populations well before these variants are named and added to phylogenetic databases. Using data from San Diego wastewater, we monitor these same population changes from raw, unassembled sequence. This history of emerging variants senses all available data as it is sequenced, intimating variant sweeps to dominance or declines to extinction at the leading edge of the COVID-19 pandemic.
新冠疫情凸显了基因组监测在指导政策和防控方面的关键作用。及时性至关重要,但序列比对和系统发育学使大多数监测技术变得缓慢。已经组装了数以百万计的 SARS-CoV-2 基因组。系统发育方法难以处理如此庞大的规模。我们引入了一种泛基因组度量方法,该方法检查从一个国家的全部临床、混合或废水序列中提取的 -mer 文库的信息多样性。量化多样性是生态学的核心。Hill 数,即样本中有效物种的数量,为跨环境比较物种多样性提供了一个简单的指标。样本越多样化,Hill 数就越高。我们采用这种生态学方法,将每个 -mer 视为一个个体,将每个基因组视为物种泛基因组中的一个横切面。按照这种方式构建,Hill 数概括了大流行变体的时间轨迹,将每天的组装体折叠成基因组等效物。对于混合或废水序列,我们使用与个体感染分离的调查序列来比较每天的情况。通过来自英国、美国和南非的数据,我们在这些变体被命名并添加到系统发育数据库之前,就很好地追踪了当地人群中新的关注变体的出现。我们使用圣地亚哥废水的数据来监测从原始、未组装序列中发生的这些相同的种群变化。这种新兴变体的历史记录了随着时间的推移所有可用数据的变化,暗示了变体在 COVID-19 大流行的前沿向优势或灭绝的扫荡。