一个用于核苷酸和氨基酸搜索的优化FM索引库。

An optimized FM-index library for nucleotide and amino acid search.

作者信息

Anderson Tim, Wheeler Travis J

机构信息

Department of Computer Science, University of Montana, Missoula, MT, USA.

出版信息

Algorithms Mol Biol. 2021 Dec 31;16(1):25. doi: 10.1186/s13015-021-00204-6.

DOI:10.1186/s13015-021-00204-6

PMID:34972525

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8719400/

Abstract

BACKGROUND

Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library.

RESULTS

We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index's suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3's FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is [Formula: see text]2-4x faster than SeqAn3 for nucleotide search, and [Formula: see text]2-6x faster for amino acid search; it is also [Formula: see text]4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage.

CONCLUSIONS

AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.

摘要

背景

模式匹配是各种生物序列分析流程中的关键步骤。FM索引是一种用于模式匹配的压缩数据结构，其搜索运行时间与数据库文本的长度无关。FM索引的实现相当复杂，因此，一个快速且灵活的FM索引库将有助于其更广泛地被采用。

结果

我们展示了AvxWindowedFMindex（AWFM索引），这是一个用C语言编写的轻量级、开源、线程并行的FM索引库，针对核苷酸和氨基酸序列索引进行了优化。AWFM索引引入了一种新方法，以跨步位向量格式存储FM索引数据，通过AVX2按位指令能够极其高效地计算FM索引出现函数，并将其与索引后缀数组的可选磁盘存储以及用于部分k-mer搜索的缓存高效查找表相结合。在一系列可比内存占用情况下，AWFM索引执行精确匹配计数和定位查询的速度比SeqAn3的FM索引实现更快。在针对速度进行优化时，对于核苷酸搜索，AWFM索引比SeqAn3快2至4倍，对于氨基酸搜索快2至6倍；当将后缀数组存储在磁盘固态硬盘中时，在类似内存占用情况下，它也快4倍。