• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

PFP-FM:一种加速的FM索引。

PFP-FM: An Accelerated FM-index.

作者信息

Hong Aaron, Oliva Marco, Köppl Dominik, Bannai Hideo, Boucher Christina, Gagie Travis

机构信息

Department of Computer and Information Science and Engineering University of Florida, Gainesville, 32611, Florida, USA.

Faculty of Engineering, University of Yamanashi, Kōfu, 400-8510, Japan.

出版信息

Res Sq. 2023 Oct 30:rs.3.rs-3487536. doi: 10.21203/rs.3.rs-3487536/v1.

DOI:10.21203/rs.3.rs-3487536/v1
PMID:37961504
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10635359/
Abstract

FM-indexes are a crucial data structure in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. Last year, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing-which takes parameters that let us tune the average length of the phrases-instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38, and is consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it seems our method accelerates the performance of count over all state-of-the-art methods with a minor increase in the memory. The source code for PFP-FM is available at https://github.com/marco-oliva/afm.

摘要

FM索引是DNA比对中的一种关键数据结构,但使用它们进行搜索通常在查询模式中每个字符至少需要一次随机访问。费拉吉纳和菲舍尔[1]在2007年观察到,基于单词的索引通常比基于字符的索引使用更少的随机访问,因此支持更快的搜索。然而,由于DNA缺乏自然的单词边界,在应用基于单词的FM索引之前,有必要以某种方式对其进行解析。去年,邓等人[2]提出通过诱导后缀排序来解析基因组数据,并表明当模式为几千个字符或更长时,由此产生的基于单词的FM索引比标准FM索引支持更快的计数查询。在本文中,我们表明使用无前缀解析(它采用参数,使我们能够调整短语的平均长度)而不是诱导后缀排序,对于只有几百个字符的模式能显著加快速度。我们实现了我们的方法,并证明在对GRCh38的查询中,它比竞争方法快3到18倍,并且在对25000、50000和100000个SARS-CoV-2基因组的查询中始终更快。因此,我们的方法似乎在内存略有增加的情况下,加速了所有现有技术方法的计数性能。PFP-FM的源代码可在https://github.com/marco-oliva/afm获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f170/10635359/1a054e5aff40/nihpp-rs3487536v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f170/10635359/8ff54ff0f915/nihpp-rs3487536v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f170/10635359/14fff3c0cdc7/nihpp-rs3487536v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f170/10635359/a4c9c21e00d0/nihpp-rs3487536v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f170/10635359/3c3945d95ac0/nihpp-rs3487536v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f170/10635359/1a054e5aff40/nihpp-rs3487536v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f170/10635359/8ff54ff0f915/nihpp-rs3487536v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f170/10635359/14fff3c0cdc7/nihpp-rs3487536v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f170/10635359/a4c9c21e00d0/nihpp-rs3487536v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f170/10635359/3c3945d95ac0/nihpp-rs3487536v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f170/10635359/1a054e5aff40/nihpp-rs3487536v1-f0005.jpg

相似文献

1
PFP-FM: An Accelerated FM-index.PFP-FM:一种加速的FM索引。
Res Sq. 2023 Oct 30:rs.3.rs-3487536. doi: 10.21203/rs.3.rs-3487536/v1.
2
Pfp-fm: an accelerated FM-index.Pfp-fm:一种加速的FM索引。
Algorithms Mol Biol. 2024 Apr 10;19(1):15. doi: 10.1186/s13015-024-00260-8.
3
Efficient Construction and Utilization of k-Ordered FM-indexes with kISS for Ultra-Fast Read Mapping in Large Genomes.利用kISS高效构建和使用k阶FM索引以实现大型基因组中的超快速读段映射
Bioinformatics. 2024 Jun 19;40(7). doi: 10.1093/bioinformatics/btae409.
4
Recursive Prefix-Free Parsing for Building Big BWTs.用于构建大型Burrows-Wheeler变换的递归无前缀解析
bioRxiv. 2023 Jan 20:2023.01.18.524557. doi: 10.1101/2023.01.18.524557.
5
Recursive Prefix-Free Parsing for Building Big BWTs.用于构建大型Burrows-Wheeler变换的递归无前缀解析
Proc Data Compress Conf. 2023 Mar;2023:62-70. Epub 2023 May 19.
6
Building a pangenome alignment index via recursive prefix-free parsing.通过递归无前缀解析构建泛基因组比对索引。
iScience. 2024 Sep 12;27(10):110933. doi: 10.1016/j.isci.2024.110933. eCollection 2024 Oct 18.
7
PFP Compressed Suffix Trees.PFP压缩后缀树
Proc Worksh Algorithm Eng Exp. 2021;2021:60-72. doi: 10.1137/1.9781611976472.5.
8
Computing the original eBWT faster, simpler, and with less memory.更快、更简单且占用更少内存地计算原始增强型Burrows-Wheeler变换。
Int Symp String Process Inf Retr. 2021 Oct;12944:129-142. doi: 10.1007/978-3-030-86692-1_11. Epub 2021 Sep 27.
9
FMtree: a fast locating algorithm of FM-indexes for genomic data.FMtree:一种用于基因组数据的 FM-indexes 的快速定位算法。
Bioinformatics. 2018 Feb 1;34(3):416-424. doi: 10.1093/bioinformatics/btx596.
10
An optimized FM-index library for nucleotide and amino acid search.一个用于核苷酸和氨基酸搜索的优化FM索引库。
Algorithms Mol Biol. 2021 Dec 31;16(1):25. doi: 10.1186/s13015-021-00204-6.

本文引用的文献

1
Sustainable data analysis with Snakemake.使用 Snakemake 进行可持续数据分析。
F1000Res. 2021 Jan 18;10:33. doi: 10.12688/f1000research.29032.2. eCollection 2021.
2
Matching Reads to Many Genomes with the -Index.使用 -索引将 reads 与多个基因组进行匹配。
J Comput Biol. 2020 Apr;27(4):514-518. doi: 10.1089/cmb.2019.0316. Epub 2020 Mar 16.
3
Prefix-free parsing for building big BWTs.用于构建大型Burrows-Wheeler变换(BWT)的无前缀解析
Algorithms Mol Biol. 2019 May 24;14:13. doi: 10.1186/s13015-019-0148-5. eCollection 2019.
4
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.短DNA序列与人类基因组的超快速且内存高效比对。
Genome Biol. 2009;10(3):R25. doi: 10.1186/gb-2009-10-3-r25. Epub 2009 Mar 4.