Clinical Bioinformatics Area, Fundación Progreso y Salud (FPS), CDCA, Hospital Virgen del Rocio, 41013 Sevilla, Spain.
Computational Systems Medicine, Institute of Biomedicine of Seville (IBIS), Hospital Virgen del Rocio, 41013 Sevilla, Spain.
Gigascience. 2021 Dec 2;10(12). doi: 10.1093/gigascience/giab078.
The current SARS-CoV-2 pandemic has emphasized the utility of viral whole-genome sequencing in the surveillance and control of the pathogen. An unprecedented ongoing global initiative is producing hundreds of thousands of sequences worldwide. However, the complex circumstances in which viruses are sequenced, along with the demand of urgent results, causes a high rate of incomplete and, therefore, useless sequences. Viral sequences evolve in the context of a complex phylogeny and different positions along the genome are in linkage disequilibrium. Therefore, an imputation method would be able to predict missing positions from the available sequencing data.
We have developed the impuSARS application, which takes advantage of the enormous number of SARS-CoV-2 genomes available, using a reference panel containing 239,301 sequences, to produce missing data imputation in viral genomes. ImpuSARS was tested in a wide range of conditions (continuous fragments, amplicons or sparse individual positions missing), showing great fidelity when reconstructing the original sequences, recovering the lineage with a 100% precision for almost all the lineages, even in very poorly covered genomes (<20%).
Imputation can improve the pace of SARS-CoV-2 sequencing production by recovering many incomplete or low-quality sequences that would be otherwise discarded. ImpuSARS can be incorporated in any primary data processing pipeline for SARS-CoV-2 whole-genome sequencing.
当前的 SARS-CoV-2 大流行强调了病毒全基因组测序在病原体监测和控制中的效用。一项空前的全球倡议正在全世界产生数十万条序列。然而,病毒测序的复杂情况以及对紧急结果的需求,导致了大量不完整的序列,因此这些序列毫无用处。病毒序列在复杂的系统发育背景下进化,基因组的不同位置处于连锁不平衡状态。因此,一种推断方法能够根据可用的测序数据预测缺失的位置。
我们开发了 impuSARS 应用程序,该程序利用大量现有的 SARS-CoV-2 基因组,使用包含 239301 条序列的参考面板,对病毒基因组中的缺失数据进行推断。impuSARS 在广泛的条件下进行了测试(连续片段、扩增子或稀疏的单个位置缺失),在重建原始序列时具有很高的准确性,几乎所有的谱系都能以 100%的精度恢复,即使在覆盖度非常低的基因组中(<20%)也是如此。
推断可以通过恢复许多原本会被丢弃的不完整或低质量的序列来提高 SARS-CoV-2 测序的速度。impuSARS 可以被整合到任何 SARS-CoV-2 全基因组测序的原始数据处理管道中。