Department of Integrative Biology, University of California-Berkeley, 4098 Valley Life Sciences Building, Berkeley, CA 94720, USA.
School of Mathematical Sciences, Peking University, Beijing 100871, China.
Cell Rep Methods. 2022 Oct 24;2(10):100313. doi: 10.1016/j.crmeth.2022.100313. Epub 2022 Sep 20.
Wastewater surveillance has become essential for monitoring the spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The quantification of SARS-CoV-2 RNA in wastewater correlates with the coronavirus disease 2019 (COVID-19) caseload in a community. However, estimating the proportions of different SARS-CoV-2 haplotypes has remained technically difficult. We present a phylogenetic imputation method for improving the SARS-CoV-2 reference database and a method for estimating the relative proportions of SARS-CoV-2 haplotypes from wastewater samples. The phylogenetic imputation method uses the global SARS-CoV-2 phylogeny and imputes based on the maximum of the posterior probability of each nucleotide. We show that the imputation method has error rates comparable to, or lower than, typical sequencing error rates, which substantially improves the reference database and allows for accurate inferences of haplotype composition. Our method for estimating relative proportions of haplotypes uses an initial step to remove unlikely haplotypes and an expectation maximization (EM) algorithm for obtaining maximum likelihood estimates of the proportions of different haplotypes in a sample. Using simulations with a reference database of >3 million SARS-CoV-2 genomes, we show that the estimated proportions reflect the true proportions given sufficiently high sequencing depth.
污水监测对于监测严重急性呼吸综合征冠状病毒 2 (SARS-CoV-2) 的传播变得至关重要。污水中 SARS-CoV-2 RNA 的定量与社区中 2019 年冠状病毒病 (COVID-19) 的病例数相关。然而,估计不同 SARS-CoV-2 单倍型的比例在技术上仍然很困难。我们提出了一种用于改进 SARS-CoV-2 参考数据库的系统发育推断方法,以及一种从污水样本中估计 SARS-CoV-2 单倍型相对比例的方法。系统发育推断方法使用全球 SARS-CoV-2 系统发育,并根据每个核苷酸的后验概率最大值进行推断。我们表明,推断方法的错误率与典型测序错误率相当,或者低于典型测序错误率,这大大改进了参考数据库,并允许对单倍型组成进行准确推断。我们用于估计单倍型相对比例的方法使用初始步骤来去除不太可能的单倍型,以及期望最大化 (EM) 算法来获得样本中单倍型比例的最大似然估计。使用具有 >300 万个 SARS-CoV-2 基因组的参考数据库进行模拟,我们表明,在所使用的测序深度足够高的情况下,估计的比例反映了真实的比例。