IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1117-1131. doi: 10.1109/TCBB.2017.2760829. Epub 2017 Oct 9.
Counting and indexing fixed length substrings, or $k$k-mers, in biological sequences is a key step in many bioinformatics tasks including genome alignment and mapping, genome assembly, and error correction. While advances in next generation sequencing technologies have dramatically reduced the cost and improved latency and throughput, few bioinformatics tools can efficiently process the datasets at the current generation rate of 1.8 terabases per 3-day experiment from a single sequencer. We present Kmerind, a high performance parallel $k$k-mer indexing library for distributed memory environments. The Kmerind library provides a set of simple and consistent APIs with sequential semantics and parallel implementations that are designed to be flexible and extensible. Kmerind's $k$k-mer counter performs similarly or better than the best existing $k$k-mer counting tools even on shared memory systems. In a distributed memory environment, Kmerind counts $k$k-mers in a 120 GB sequence read dataset in less than 13 seconds on 1024 Xeon CPU cores, and fully indexes their positions in approximately 17 seconds. Querying for 1 percent of the $k$k-mers in these indices can be completed in 0.23 seconds and 28 seconds, respectively. Kmerind is the first $k$k-mer indexing library for distributed memory environments, and the first extensible library for general $k$k-mer indexing and counting. Kmerind is available at https://github.com/ParBLiSS/kmerind.
在许多生物信息学任务中,例如基因组比对和映射、基因组组装和错误纠正,对生物序列中的固定长度子字符串(或 $k$ -mer)进行计数和索引是关键步骤。尽管下一代测序技术的进步极大地降低了成本并提高了延迟和吞吐量,但很少有生物信息学工具能够有效地处理当前单台测序仪每 3 天实验产生 1.8 太字节数据集的速度。我们提出了 Kmerind,这是一种用于分布式内存环境的高性能并行 $k$ -mer 索引库。Kmerind 库提供了一组简单而一致的 API,具有顺序语义和并行实现,旨在具有灵活性和可扩展性。即使在共享内存系统上,Kmerind 的 $k$ -mer 计数器的性能也与现有的最佳 $k$ -mer 计数工具相似或更好。在分布式内存环境中,Kmerind 在 1024 个 Xeon CPU 内核上不到 13 秒即可对 120GB 序列读取数据集进行 $k$ -mer 计数,并在大约 17 秒内完全索引其位置。在这些索引中查询 1%的 $k$ -mer,可以分别在 0.23 秒和 28 秒内完成。Kmerind 是第一个用于分布式内存环境的 $k$ -mer 索引库,也是第一个用于通用 $k$ -mer 索引和计数的可扩展库。Kmerind 可在 https://github.com/ParBLiSS/kmerind 上获得。