Max Planck Institute for Molecular Cell Biology and Genetics (MPI-CBG and CSBD), Dresden 01307, Germany.
Blue River Technology, Sunnyvale, CA 94086, USA.
Bioinformatics. 2022 Mar 28;38(7):1838-1845. doi: 10.1093/bioinformatics/btac064.
Fast, lightweight methods for comparing the sequence of ever larger assembled genomes from ever growing databases are increasingly needed in the era of accurate long reads and pan-genome initiatives. Matching statistics is a popular method for computing whole-genome phylogenies and for detecting structural rearrangements between two genomes, since it is amenable to fast implementations that require a minimal setup of data structures. However, current implementations use a single core, take too much memory to represent the result, and do not provide efficient ways to analyze the output in order to explore local similarities between the sequences.
We develop practical tools for computing matching statistics between large-scale strings, and for analyzing its values, faster and using less memory than the state-of-the-art. Specifically, we design a parallel algorithm for shared-memory machines that computes matching statistics 30 times faster with 48 cores in the cases that are most difficult to parallelize. We design a lossy compression scheme that shrinks the matching statistics array to a bitvector that takes from 0.8 to 0.2 bits per character, depending on the dataset and on the value of a threshold, and that achieves 0.04 bits per character in some variants. And we provide efficient implementations of range-maximum and range-sum queries that take a few tens of milliseconds while operating on our compact representations, and that allow computing key local statistics about the similarity between two strings. Our toolkit makes construction, storage and analysis of matching statistics arrays practical for multiple pairs of the largest genomes available today, possibly enabling new applications in comparative genomics.
Our C/C++ code is available at https://github.com/odenas/indexed_ms under GPL-3.0. The data underlying this article are available in NCBI Genome at https://www.ncbi.nlm.nih.gov/genome and in the International Genome Sample Resource (IGSR) at https://www.internationalgenome.org.
Supplementary data are available at Bioinformatics online.
在长读长、泛基因组计划时代,需要快速、轻量级的方法来比较来自不断增长的数据库的不断增大的组装基因组序列。匹配统计是计算全基因组系统发育和检测两个基因组之间结构重排的一种流行方法,因为它易于实现,需要最小的数据结构设置。然而,当前的实现使用单个核心,占用太多的内存来表示结果,并且没有提供有效的方法来分析输出,以探索序列之间的局部相似性。
我们开发了用于计算大规模字符串之间匹配统计的实用工具,并设计了一种用于分析其值的方法,与最先进的方法相比,速度更快,内存消耗更少。具体来说,我们为共享内存机器设计了一种并行算法,在最难以并行化的情况下,使用 48 个核可将匹配统计计算速度提高 30 倍。我们设计了一种有损压缩方案,将匹配统计数组缩小为位向量,根据数据集和阈值的值,每个字符占用 0.8 到 0.2 位,在某些变体中,每个字符占用 0.04 位。并且我们提供了范围最大值和范围总和查询的有效实现,这些查询在操作我们的紧凑表示时只需几十毫秒,并且允许计算两个字符串之间相似性的关键局部统计信息。我们的工具包使得构建、存储和分析匹配统计数组对于当今可用的最大基因组的多对成为可能,可能为比较基因组学中的新应用提供支持。
我们的 C/C++ 代码可在 https://github.com/odenas/indexed_ms 下获得,遵循 GPL-3.0 许可。本文所依据的数据可在 NCBI 基因组中获得 https://www.ncbi.nlm.nih.gov/genome 和在国际基因组样本资源(IGSR)中获得 https://www.internationalgenome.org。
补充数据可在生物信息学在线获得。