Zhuang Xiaowei, Vo Van, Moshi Michael A, Dhede Ketan, Ghani Nabih, Akbar Shahraiz, Chang Ching-Lan, Young Angelia K, Buttery Erin, Bendik William, Zhang Hong, Afzal Salman, Moser Duane, Cordes Dietmar, Lockett Cassius, Gerrity Daniel, Kan Horng-Yuan, Oh Edwin C
medRxiv. 2024 Apr 19:2024.04.18.24306052. doi: 10.1101/2024.04.18.24306052.
Genome sequencing from wastewater has emerged as an accurate and cost-effective tool for identifying SARS-CoV-2 variants. However, existing methods for analyzing wastewater sequencing data are not designed to detect novel variants that have not been characterized in humans. Here, we present an unsupervised learning approach that clusters co-varying and time-evolving mutation patterns leading to the identification of SARS-CoV-2 variants. To build our model, we sequenced 3,659 wastewater samples collected over a span of more than two years from urban and rural locations in Southern Nevada. We then developed a multivariate independent component analysis (ICA)-based pipeline to transform mutation frequencies into independent sources with co-varying and time-evolving patterns and compared variant predictions to >5,000 SARS-CoV-2 clinical genomes isolated from Nevadans. Using the source patterns as data-driven reference "barcodes", we demonstrated the model's accuracy by successfully detecting the Delta variant in late 2021, Omicron variants in 2022, and emerging recombinant XBB variants in 2023. Our approach revealed the spatial and temporal dynamics of variants in both urban and rural regions; achieved earlier detection of most variants compared to other computational tools; and uncovered unique co-varying mutation patterns not associated with any known variant. The multivariate nature of our pipeline boosts statistical power and can support accurate and early detection of SARS-CoV-2 variants. This feature offers a unique opportunity for novel variant and pathogen detection, even in the absence of clinical testing.
污水基因组测序已成为一种准确且经济高效的工具,用于识别新冠病毒(SARS-CoV-2)变体。然而,现有的污水测序数据分析方法并非旨在检测尚未在人类中得到表征的新型变体。在此,我们提出一种无监督学习方法,该方法对共同变化且随时间演变的突变模式进行聚类,从而识别新冠病毒变体。为构建我们的模型,我们对在内华达州南部城市和农村地区两年多时间内收集的3659份污水样本进行了测序。然后,我们开发了一种基于多变量独立成分分析(ICA)的流程,将突变频率转换为具有共同变化和随时间演变模式的独立来源,并将变体预测结果与从内华达州人身上分离出的5000多个新冠病毒临床基因组进行比较。利用这些来源模式作为数据驱动的参考“条形码”,我们通过在2021年末成功检测到德尔塔变体、在2022年检测到奥密克戎变体以及在2023年检测到新出现的重组XBB变体,证明了该模型的准确性。我们的方法揭示了城市和农村地区变体的时空动态;与其他计算工具相比,能更早地检测到大多数变体;并发现了与任何已知变体无关的独特共同变化突变模式。我们流程的多变量性质增强了统计能力,能够支持对新冠病毒变体进行准确和早期的检测。这一特性为新型变体和病原体的检测提供了独特机会,即使在没有临床检测的情况下也是如此。