Nuffield Department of Medicine, University of Oxford, Oxford, UK.
Present address: UKRI Science and Technologies Facilities Council, Harwell, UK.
Microb Genom. 2022 Jun;8(6). doi: 10.1099/mgen.0.000850.
There is a need to identify microbial sequences that may form part of transmission chains, or that may represent importations across national boundaries, amidst large numbers of SARS-CoV-2 and other bacterial or viral sequences. Reference-based compression is a sequence analysis technique that allows both a compact storage of sequence data and comparisons between sequences. Published implementations of the approach are being challenged by the large sample collections now being generated. Our aim was to develop a fast software detecting highly similar sequences in large collections of microbial genomes, including millions of SARS-CoV-2 genomes. To do so, we developed Catwalk, a tool that bypasses bottlenecks in the generation, comparison and in-memory storage of microbial genomes generated by reference mapping. It is a compiled solution, coded in Nim to increase performance. It can be accessed via command line, rest api or web server interfaces. We tested Catwalk using both SARS-CoV-2 and genomes generated by prospective public-health sequencing programmes. Pairwise sequence comparisons, using clinically relevant similarity cut-offs, took about 0.39 and 0.66 μs, respectively; in 1 s, between 1 and 2 million sequences can be searched. Catwalk operates about 1700 times faster than, and uses about 8 % of the RAM of, a Python reference-based compression and comparison tool in current use for outbreak detection. Catwalk can rapidly identify close relatives of a SARS-CoV-2 or genome amidst millions of samples.
需要识别可能构成传播链一部分的微生物序列,或可能代表跨越国界的输入的微生物序列,这些序列中包含大量的 SARS-CoV-2 和其他细菌或病毒序列。基于参考的压缩是一种序列分析技术,允许对序列数据进行紧凑存储,并在序列之间进行比较。该方法的已发表实现受到现在生成的大量样本集合的挑战。我们的目标是开发一种快速的软件,用于在包括数百万个 SARS-CoV-2 基因组在内的大量微生物基因组集合中检测高度相似的序列。为此,我们开发了 Catwalk,这是一种工具,可绕过基于参考的映射生成、比较和微生物基因组内存存储中的瓶颈。它是一个用 Nim 编写的编译解决方案,可提高性能。它可以通过命令行、rest api 或 web 服务器接口访问。我们使用 SARS-CoV-2 和 prospective 公共卫生测序计划生成的基因组测试了 Catwalk。使用临床相关的相似性截止值进行的成对序列比较分别耗时约 0.39 和 0.66 μs;在 1 秒内,可以搜索 1 到 200 万个序列。Catwalk 的运行速度比当前用于爆发检测的基于 Python 的参考压缩和比较工具快约 1700 倍,并且使用的 RAM 约为其 8%。Catwalk 可以在数百万个样本中快速识别 SARS-CoV-2 或 基因组的近亲。