• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一个用于核苷酸和氨基酸搜索的优化FM索引库。

An optimized FM-index library for nucleotide and amino acid search.

作者信息

Anderson Tim, Wheeler Travis J

机构信息

Department of Computer Science, University of Montana, Missoula, MT, USA.

出版信息

Algorithms Mol Biol. 2021 Dec 31;16(1):25. doi: 10.1186/s13015-021-00204-6.

DOI:10.1186/s13015-021-00204-6
PMID:34972525
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8719400/
Abstract

BACKGROUND

Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library.

RESULTS

We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index's suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3's FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is [Formula: see text]2-4x faster than SeqAn3 for nucleotide search, and [Formula: see text]2-6x faster for amino acid search; it is also [Formula: see text]4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage.

CONCLUSIONS

AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.

摘要

背景

模式匹配是各种生物序列分析流程中的关键步骤。FM索引是一种用于模式匹配的压缩数据结构,其搜索运行时间与数据库文本的长度无关。FM索引的实现相当复杂,因此,一个快速且灵活的FM索引库将有助于其更广泛地被采用。

结果

我们展示了AvxWindowedFMindex(AWFM索引),这是一个用C语言编写的轻量级、开源、线程并行的FM索引库,针对核苷酸和氨基酸序列索引进行了优化。AWFM索引引入了一种新方法,以跨步位向量格式存储FM索引数据,通过AVX2按位指令能够极其高效地计算FM索引出现函数,并将其与索引后缀数组的可选磁盘存储以及用于部分k-mer搜索的缓存高效查找表相结合。在一系列可比内存占用情况下,AWFM索引执行精确匹配计数和定位查询的速度比SeqAn3的FM索引实现更快。在针对速度进行优化时,对于核苷酸搜索,AWFM索引比SeqAn3快2至4倍,对于氨基酸搜索快2至6倍;当将后缀数组存储在磁盘固态硬盘中时,在类似内存占用情况下,它也快4倍。

结论

AWFM索引易于整合到生物信息学软件中,提供运行时性能参数化,并在高级别(计数或定位查询字符串的所有实例)和低级别(FM索引反向搜索过程的逐步控制)为客户端提供FM索引功能。该开源库可在https://github.com/TravisWheelerLab/AvxWindowFmIndex下载。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/db75/8719400/145b67189548/13015_2021_204_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/db75/8719400/c5e5f49feee5/13015_2021_204_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/db75/8719400/cfdd9eb5c4d9/13015_2021_204_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/db75/8719400/ad561cefea57/13015_2021_204_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/db75/8719400/bed1b8b1d98c/13015_2021_204_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/db75/8719400/145b67189548/13015_2021_204_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/db75/8719400/c5e5f49feee5/13015_2021_204_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/db75/8719400/cfdd9eb5c4d9/13015_2021_204_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/db75/8719400/ad561cefea57/13015_2021_204_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/db75/8719400/bed1b8b1d98c/13015_2021_204_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/db75/8719400/145b67189548/13015_2021_204_Fig5_HTML.jpg

相似文献

1
An optimized FM-index library for nucleotide and amino acid search.一个用于核苷酸和氨基酸搜索的优化FM索引库。
Algorithms Mol Biol. 2021 Dec 31;16(1):25. doi: 10.1186/s13015-021-00204-6.
2
b-move: faster bidirectional character extensions in a run-length compressed index.b移动:游程长度压缩索引中更快的双向字符扩展
bioRxiv. 2024 Jun 2:2024.05.30.596587. doi: 10.1101/2024.05.30.596587.
3
A space and time-efficient index for the compacted colored de Bruijn graph.一种用于压缩彩色 de Bruijn 图的空间和时间高效索引。
Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292.
4
Efficient privacy-preserving variable-length substring match for genome sequence.用于基因组序列的高效隐私保护可变长度子串匹配
Algorithms Mol Biol. 2022 Apr 26;17(1):9. doi: 10.1186/s13015-022-00211-1.
5
Efficient Construction and Utilization of k-Ordered FM-indexes with kISS for Ultra-Fast Read Mapping in Large Genomes.利用kISS高效构建和使用k阶FM索引以实现大型基因组中的超快速读段映射
Bioinformatics. 2024 Jun 19;40(7). doi: 10.1093/bioinformatics/btae409.
6
E2FM: an encrypted and compressed full-text index for collections of genomic sequences.E2FM:用于基因组序列集合的加密和压缩全文索引。
Bioinformatics. 2017 Sep 15;33(18):2808-2817. doi: 10.1093/bioinformatics/btx313.
7
PFP-FM: An Accelerated FM-index.PFP-FM:一种加速的FM索引。
Res Sq. 2023 Oct 30:rs.3.rs-3487536. doi: 10.21203/rs.3.rs-3487536/v1.
8
Pfp-fm: an accelerated FM-index.Pfp-fm:一种加速的FM索引。
Algorithms Mol Biol. 2024 Apr 10;19(1):15. doi: 10.1186/s13015-024-00260-8.
9
PTPan--overcoming memory limitations in oligonucleotide string matching for primer/probe design.PTPan--克服引物/探针设计中寡核苷酸序列匹配的记忆限制。
Bioinformatics. 2011 Oct 15;27(20):2797-805. doi: 10.1093/bioinformatics/btr483. Epub 2011 Aug 19.
10
Breaking the -Barrier in the Construction of Compressed Suffix Arrays and Suffix Trees.突破压缩后缀数组和后缀树构建中的-障碍
Proc Annu ACM SIAM Symp Discret Algorithms. 2023;2023:5122-5202. doi: 10.1137/1.9781611977554.ch187.

引用本文的文献

1
Movi: A fast and cache-efficient full-text pangenome index.Movi:一种快速且缓存高效的全基因组索引。
iScience. 2024 Nov 27;27(12):111464. doi: 10.1016/j.isci.2024.111464. eCollection 2024 Dec 20.
2
SigAlign: an alignment algorithm guided by explicit similarity criteria.SigAlign:一种基于显式相似性标准的对齐算法。
Nucleic Acids Res. 2024 Aug 27;52(15):8717-8733. doi: 10.1093/nar/gkae607.
3
nail: software for high-speed, high-sensitivity protein sequence annotation.NAIL:用于高速、高灵敏度蛋白质序列注释的软件。

本文引用的文献

1
UniProt: the universal protein knowledgebase in 2021.UniProt:2021 年的通用蛋白质知识库。
Nucleic Acids Res. 2021 Jan 8;49(D1):D480-D489. doi: 10.1093/nar/gkaa1100.
2
An efficient error correction algorithm using FM-index.一种使用FM索引的高效错误校正算法。
BMC Bioinformatics. 2017 Nov 28;18(1):524. doi: 10.1186/s12859-017-1940-1.
3
MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.MMseqs2支持进行灵敏的蛋白质序列搜索,以分析海量数据集。
bioRxiv. 2024 Jan 30:2024.01.27.577580. doi: 10.1101/2024.01.27.577580.
4
From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures.从分子到基因组变异:通过智能算法和架构加速基因组分析
Comput Struct Biotechnol J. 2022 Aug 18;20:4579-4599. doi: 10.1016/j.csbj.2022.08.019. eCollection 2022.
Nat Biotechnol. 2017 Nov;35(11):1026-1028. doi: 10.1038/nbt.3988. Epub 2017 Oct 16.
4
FMtree: a fast locating algorithm of FM-indexes for genomic data.FMtree:一种用于基因组数据的 FM-indexes 的快速定位算法。
Bioinformatics. 2018 Feb 1;34(3):416-424. doi: 10.1093/bioinformatics/btx596.
5
The SeqAn C++ template library for efficient sequence analysis: A resource for programmers.SeqAn C++ 模板库用于高效的序列分析:面向程序员的资源。
J Biotechnol. 2017 Nov 10;261:157-168. doi: 10.1016/j.jbiotec.2017.07.017. Epub 2017 Sep 6.
6
Centrifuge: rapid and sensitive classification of metagenomic sequences.离心机:宏基因组序列的快速灵敏分类
Genome Res. 2016 Dec;26(12):1721-1729. doi: 10.1101/gr.210641.116. Epub 2016 Oct 17.
7
Fast and sensitive taxonomic classification for metagenomics with Kaiju.使用Kaiju对宏基因组学进行快速且灵敏的分类学分类。
Nat Commun. 2016 Apr 13;7:11257. doi: 10.1038/ncomms11257.
8
Fast and sensitive protein alignment using DIAMOND.使用 DIAMOND 进行快速灵敏的蛋白质比对。
Nat Methods. 2015 Jan;12(1):59-60. doi: 10.1038/nmeth.3176. Epub 2014 Nov 17.
9
Efficient construction of an assembly string graph using the FM-index.利用 FM 索引高效构建组装字符串图。
Bioinformatics. 2010 Jun 15;26(12):i367-73. doi: 10.1093/bioinformatics/btq217.
10
Fast and accurate short read alignment with Burrows-Wheeler transform.使用Burrows-Wheeler变换进行快速准确的短读比对。
Bioinformatics. 2009 Jul 15;25(14):1754-60. doi: 10.1093/bioinformatics/btp324. Epub 2009 May 18.