Jiang Nan, Dewey Colin N, Yin John
Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI, USA.
Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA.
medRxiv. 2025 Apr 28:2025.04.15.25325794. doi: 10.1101/2025.04.15.25325794.
Deletions are prevalent in the genomes of SARS-CoV-2 isolates from COVID-19 patients, but their roles in the severity, transmission, and persistence of disease are poorly understood. Millions of COVID-19 swab samples from patients have been sequenced and made available online, offering an unprecedented opportunity to study such deletions. Multiplex PCR-based amplicon sequencing (amplicon-seq) has been the most widely used method for sequencing clinical COVID-19 samples. However, existing bioinformatics methods applied to negative control samples sequenced by multiplex-PCR sequencing often yield large numbers of false-positive deletions. We found that these false positives commonly occur in short alignments, at low frequency and depth, and near primer-binding sites used for whole-genome amplification. To address this issue, we developed a filtering strategy, validated with positive control samples containing a known deletion. Our strategy accurately detected the known deletion and removed more than 99% of false positives. This method, applied to public COVID-19 swab data, revealed that deletions occurring independently of transcription regulatory sequences were about 20-fold less common than previously reported; however, they remain more frequent in symptomatic patients. Our optimized approach should enhance the reliability of SARS-CoV-2 deletion characterization from surveillance studies. Finally, our approach may guide the development of more reliable bioinformatics pipelines for genome sequence analyses of other viruses.
新冠病毒(SARS-CoV-2)感染者分离株的基因组中普遍存在缺失现象,但其在疾病严重程度、传播和持续性方面所起的作用却鲜为人知。来自患者的数百万份新冠病毒拭子样本已进行测序并在网上公开,这为研究此类缺失提供了前所未有的机会。基于多重PCR的扩增子测序(amplicon-seq)是对临床新冠病毒样本进行测序最广泛使用的方法。然而,应用于通过多重PCR测序的阴性对照样本的现有生物信息学方法,常常会产生大量假阳性缺失。我们发现,这些假阳性通常出现在短比对中,频率和深度较低,且靠近用于全基因组扩增的引物结合位点。为解决这一问题,我们开发了一种过滤策略,并使用含有已知缺失的阳性对照样本进行了验证。我们的策略准确检测到了已知缺失,并去除了超过99%的假阳性。将该方法应用于公开的新冠病毒拭子数据,结果显示,独立于转录调控序列发生的缺失比之前报道的情况少约20倍;然而,它们在有症状患者中仍然更为常见。我们优化后的方法应能提高监测研究中新冠病毒缺失特征描述的可靠性。最后,我们的方法可能会为开发用于其他病毒基因组序列分析的更可靠生物信息学流程提供指导。