Suppr超能文献

一个用于核苷酸和氨基酸搜索的优化FM索引库。

An optimized FM-index library for nucleotide and amino acid search.

作者信息

Anderson Tim, Wheeler Travis J

机构信息

Department of Computer Science, University of Montana, Missoula, MT, USA.

出版信息

Algorithms Mol Biol. 2021 Dec 31;16(1):25. doi: 10.1186/s13015-021-00204-6.

Abstract

BACKGROUND

Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library.

RESULTS

We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index's suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3's FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is [Formula: see text]2-4x faster than SeqAn3 for nucleotide search, and [Formula: see text]2-6x faster for amino acid search; it is also [Formula: see text]4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage.

CONCLUSIONS

AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.

摘要

背景

模式匹配是各种生物序列分析流程中的关键步骤。FM索引是一种用于模式匹配的压缩数据结构,其搜索运行时间与数据库文本的长度无关。FM索引的实现相当复杂,因此,一个快速且灵活的FM索引库将有助于其更广泛地被采用。

结果

我们展示了AvxWindowedFMindex(AWFM索引),这是一个用C语言编写的轻量级、开源、线程并行的FM索引库,针对核苷酸和氨基酸序列索引进行了优化。AWFM索引引入了一种新方法,以跨步位向量格式存储FM索引数据,通过AVX2按位指令能够极其高效地计算FM索引出现函数,并将其与索引后缀数组的可选磁盘存储以及用于部分k-mer搜索的缓存高效查找表相结合。在一系列可比内存占用情况下,AWFM索引执行精确匹配计数和定位查询的速度比SeqAn3的FM索引实现更快。在针对速度进行优化时,对于核苷酸搜索,AWFM索引比SeqAn3快2至4倍,对于氨基酸搜索快2至6倍;当将后缀数组存储在磁盘固态硬盘中时,在类似内存占用情况下,它也快4倍。

结论

AWFM索引易于整合到生物信息学软件中,提供运行时性能参数化,并在高级别(计数或定位查询字符串的所有实例)和低级别(FM索引反向搜索过程的逐步控制)为客户端提供FM索引功能。该开源库可在https://github.com/TravisWheelerLab/AvxWindowFmIndex下载。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/db75/8719400/c5e5f49feee5/13015_2021_204_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验