Public Health Agency of Canada, National Microbiology Laboratory, Winnipeg, Canada.
Cadham Provincial Laboratory, Winnipeg, Canada.
PLoS Comput Biol. 2024 Aug 19;20(8):e1011539. doi: 10.1371/journal.pcbi.1011539. eCollection 2024 Aug.
The SARS-CoV-2 pandemic has brought molecular biology and genomic sequencing into the public consciousness and lexicon. With an emphasis on rapid turnaround, genomic data informed both diagnostic and surveillance decisions for the current pandemic at a previously unheard-of scale. The surge in the submission of genomic data to publicly available databases proved essential as comparing different genome sequences offers a wealth of knowledge, including phylogenetic links, modes of transmission, rates of evolution, and the impact of mutations on infection and disease severity. However, the scale of the pandemic has meant that sequencing runs are rarely repeated due to limited sample material and/or the availability of sequencing resources, resulting in the upload of some imperfect runs to public repositories. As a result, it is crucial to investigate the data obtained from these imperfect runs to determine whether the results are reliable prior to depositing them in a public database. Numerous studies have identified a variety of sources of contamination in public next-generation sequencing (NGS) data as the number of NGS studies increases along with the diversity of sequencing technologies and procedures. For this study, we conducted an in silico experiment with known SARS-CoV-2 sequences produced from Oxford Nanopore Technologies sequencing to investigate the effect of contamination on lineage calls and single nucleotide variants (SNVs). A contamination threshold below which runs are expected to generate accurate lineage calls and maintain genome-relatedness and integrity was identified. Together, these findings provide a benchmark below which imperfect runs may be considered robust for reporting results to both stakeholders and public repositories and reduce the need for repeat or wasted runs.
SARS-CoV-2 大流行使分子生物学和基因组测序进入了公众的意识和词汇中。由于强调快速周转,基因组数据以前所未有的规模为当前大流行的诊断和监测决策提供了信息。向公共可用数据库提交基因组数据的激增被证明是至关重要的,因为比较不同的基因组序列提供了丰富的知识,包括系统发育联系、传播模式、进化速度以及突变对感染和疾病严重程度的影响。然而,大流行的规模意味着由于样本材料有限和/或测序资源的可用性,很少重复测序运行,导致一些不完美的运行被上传到公共存储库。因此,在将这些不完美的运行结果存入公共数据库之前,必须调查从这些运行中获得的数据,以确定结果是否可靠。随着高通量测序(NGS)研究数量的增加以及测序技术和程序的多样性,许多研究已经确定了公共 NGS 数据中存在多种污染来源。在这项研究中,我们使用来自牛津纳米孔技术测序的已知 SARS-CoV-2 序列进行了计算机模拟实验,以研究污染对谱系调用和单核苷酸变异(SNV)的影响。确定了一个低于该污染阈值的运行,预计该运行将产生准确的谱系调用,并保持基因组相关性和完整性。这些发现共同提供了一个基准,低于该基准,不完美的运行可能被认为是可靠的,可以向利益相关者和公共存储库报告结果,并减少重复或浪费运行的需要。