Beijing Institute of Genomics, Chinese Academy of Sciences, and China National Center for Bioinformation, Beijing 100101, China.
University of Chinese Academy of Sciences, Beijing 100049, China.
Brief Bioinform. 2023 Sep 22;24(6). doi: 10.1093/bib/bbad339.
To contain infectious diseases, it is crucial to determine the origin and transmission routes of the pathogen, as well as how the virus evolves. With the development of genome sequencing technology, genome epidemiology has emerged as a powerful approach for investigating the source and transmission of pathogens. In this study, we first presented the rationale for genomic tracing of SARS-CoV-2 and the challenges we currently face. Identifying the most genetically similar reference sequence to the query sequence is a critical step in genome tracing, typically achieved using either a phylogenetic tree or a sequence similarity search. However, these methods become inefficient or computationally prohibitive when dealing with tens of millions of sequences in the reference database, as we encountered during the COVID-19 pandemic. To address this challenge, we developed a novel genomic tracing algorithm capable of processing 6 million SARS-CoV-2 sequences in less than a minute. Instead of constructing a giant phylogenetic tree, we devised a weighted scoring system based on mutation characteristics to quantify sequences similarity. The developed method demonstrated superior performance compared to previous methods. Additionally, an online platform was developed to facilitate genomic tracing and visualization of the spatiotemporal distribution of sequences. The method will be a valuable addition to standard epidemiological investigations, enabling more efficient genomic tracing. Furthermore, the computational framework can be easily adapted to other pathogens, paving the way for routine genomic tracing of infectious diseases.
为了控制传染病,确定病原体的来源和传播途径以及病毒的进化方式至关重要。随着基因组测序技术的发展,基因组流行病学已成为研究病原体来源和传播的有力方法。在本研究中,我们首先介绍了对 SARS-CoV-2 进行基因组溯源的基本原理和我们目前面临的挑战。确定与查询序列最具遗传相似性的参考序列是基因组溯源的关键步骤,通常使用系统发育树或序列相似性搜索来实现。然而,当我们在 COVID-19 大流行期间遇到参考数据库中包含数千万个序列时,这些方法变得效率低下或计算上不可行。为了解决这一挑战,我们开发了一种新的基因组溯源算法,能够在不到一分钟的时间内处理 600 万个 SARS-CoV-2 序列。我们没有构建巨大的系统发育树,而是设计了一种基于突变特征的加权评分系统来量化序列的相似性。与以前的方法相比,所开发的方法表现出了优越的性能。此外,还开发了一个在线平台,以促进基因组溯源和序列时空分布的可视化。该方法将是标准流行病学调查的有力补充,使基因组溯源更加高效。此外,计算框架可以轻松适应其他病原体,为传染病的常规基因组溯源铺平道路。