• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过递归无前缀解析构建泛基因组比对索引。

Building a pangenome alignment index via recursive prefix-free parsing.

作者信息

Ferro Eddie, Oliva Marco, Gagie Travis, Boucher Christina

机构信息

Department of Computer and Information Science and Engineering, Herbert-Wertheim College of Engineering, University of Florida, Gainesville, FL 32607, USA.

Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada.

出版信息

iScience. 2024 Sep 12;27(10):110933. doi: 10.1016/j.isci.2024.110933. eCollection 2024 Oct 18.

DOI:10.1016/j.isci.2024.110933
PMID:39391725
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11465122/
Abstract

Pangenomics alignment offers a solution to reduce bias in biomedical research. Traditionally, short-read aligners like Bowtie and BWA indexed a single reference genome to find approximate alignments. These methods, limited by linear-memory requirements, can only index a few genomes. Emerging pangenome aligners, such as VG, Giraffe, and Moni, address this by indexing more genomes. VG and Giraffe use a variation graph, while Moni indexes sequences accounting for repetition using prefix-free parsing to build a dictionary and parse. The main challenge is the parse's size, which becomes significantly larger than the dictionary. To scale Moni, we propose removing the parse from the construction of the run-length encoded BWT (RLBWT), suffix array, and Longest Common Prefix (LCP) by applying prefix-free parsing recursively. This approach improves construction time and memory requirements, enabling efficient construction of RLBWT, suffix array, and LCP for large pangenomes, such as those from the Human Pangenome Reference Consortium.

摘要

泛基因组比对为减少生物医学研究中的偏差提供了一种解决方案。传统上,像Bowtie和BWA这样的短读长比对工具会索引单个参考基因组以找到近似比对。这些方法受限于线性内存需求,只能索引少数几个基因组。新兴的泛基因组比对工具,如VG、Giraffe和Moni,通过索引更多基因组来解决这个问题。VG和Giraffe使用变异图,而Moni使用无前缀解析来索引考虑重复的序列,以构建字典并进行解析。主要挑战在于解析的大小,它会变得比字典大得多。为了扩展Moni,我们建议通过递归应用无前缀解析,在构建游程编码的BWT(RLBWT)、后缀数组和最长公共前缀(LCP)时去除解析。这种方法改善了构建时间和内存需求,能够为大型泛基因组(如人类泛基因组参考联盟的那些泛基因组)高效构建RLBWT、后缀数组和LCP。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7c2b/11465122/da8d540ec149/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7c2b/11465122/f4ceece45f15/fx1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7c2b/11465122/10c662e0f8e4/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7c2b/11465122/0076d1a1fd11/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7c2b/11465122/f2fc1e55a68e/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7c2b/11465122/da8d540ec149/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7c2b/11465122/f4ceece45f15/fx1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7c2b/11465122/10c662e0f8e4/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7c2b/11465122/0076d1a1fd11/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7c2b/11465122/f2fc1e55a68e/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7c2b/11465122/da8d540ec149/gr4.jpg

相似文献

1
Building a pangenome alignment index via recursive prefix-free parsing.通过递归无前缀解析构建泛基因组比对索引。
iScience. 2024 Sep 12;27(10):110933. doi: 10.1016/j.isci.2024.110933. eCollection 2024 Oct 18.
2
Recursive Prefix-Free Parsing for Building Big BWTs.用于构建大型Burrows-Wheeler变换的递归无前缀解析
bioRxiv. 2023 Jan 20:2023.01.18.524557. doi: 10.1101/2023.01.18.524557.
3
Recursive Prefix-Free Parsing for Building Big BWTs.用于构建大型Burrows-Wheeler变换的递归无前缀解析
Proc Data Compress Conf. 2023 Mar;2023:62-70. Epub 2023 May 19.
4
MONI: A Pangenomic Index for Finding Maximal Exact Matches.MONI:用于寻找最大精确匹配的泛基因组索引。
J Comput Biol. 2022 Feb;29(2):169-187. doi: 10.1089/cmb.2021.0290. Epub 2022 Jan 17.
5
Prefix-free parsing for building big BWTs.用于构建大型Burrows-Wheeler变换(BWT)的无前缀解析
Algorithms Mol Biol. 2019 May 24;14:13. doi: 10.1186/s13015-019-0148-5. eCollection 2019.
6
External memory BWT and LCP computation for sequence collections with applications.用于序列集合的外部内存BWT和LCP计算及其应用
Algorithms Mol Biol. 2019 Mar 8;14:6. doi: 10.1186/s13015-019-0140-0. eCollection 2019.
7
PFP Compressed Suffix Trees.PFP压缩后缀树
Proc Worksh Algorithm Eng Exp. 2021;2021:60-72. doi: 10.1137/1.9781611976472.5.
8
Efficient Construction of a Complete Index for Pan-Genomics Read Alignment.高效构建全基因组读段比对的完整索引。
J Comput Biol. 2020 Apr;27(4):500-513. doi: 10.1089/cmb.2019.0309. Epub 2020 Mar 16.
9
Pangenome graph construction from genome alignments with Minigraph-Cactus.基于 Minigraph-Cactus 的基因组比对构建泛基因组图谱。
Nat Biotechnol. 2024 Apr;42(4):663-673. doi: 10.1038/s41587-023-01793-w. Epub 2023 May 10.
10
Fast and memory efficient approach for mapping NGS reads to a reference genome.将二代测序(NGS) reads 映射到参考基因组的快速且内存高效的方法。
J Bioinform Comput Biol. 2019 Apr;17(2):1950008. doi: 10.1142/S0219720019500082.

引用本文的文献

1
Mumemto: efficient maximal matching across pangenomes.Mumemto:跨泛基因组的高效最大匹配
Genome Biol. 2025 Jun 17;26(1):169. doi: 10.1186/s13059-025-03644-0.
2
Partitioned Multi-MUM finding for scalable pangenomics.用于可扩展全基因组学的分区多MUM查找
bioRxiv. 2025 May 25:2025.05.20.654611. doi: 10.1101/2025.05.20.654611.
3
Mumemto: efficient maximal matching across pangenomes.Mumemto:跨全基因组的高效最大匹配

本文引用的文献

1
Recursive Prefix-Free Parsing for Building Big BWTs.用于构建大型Burrows-Wheeler变换的递归无前缀解析
Proc Data Compress Conf. 2023 Mar;2023:62-70. Epub 2023 May 19.
2
CSTs for Terabyte-Sized Data.用于太字节级数据的CST
Proc Data Compress Conf. 2022 Mar;2022:93-102. doi: 10.1109/dcc52660.2022.00017. Epub 2022 Jul 4.
3
Computing the original eBWT faster, simpler, and with less memory.更快、更简单且占用更少内存地计算原始增强型Burrows-Wheeler变换。
bioRxiv. 2025 Jan 5:2025.01.05.631388. doi: 10.1101/2025.01.05.631388.
Int Symp String Process Inf Retr. 2021 Oct;12944:129-142. doi: 10.1007/978-3-030-86692-1_11. Epub 2021 Sep 27.
4
A draft human pangenome reference.人类泛基因组参考草图。
Nature. 2023 May;617(7960):312-324. doi: 10.1038/s41586-023-05896-x. Epub 2023 May 10.
5
MONI: A Pangenomic Index for Finding Maximal Exact Matches.MONI:用于寻找最大精确匹配的泛基因组索引。
J Comput Biol. 2022 Feb;29(2):169-187. doi: 10.1089/cmb.2021.0290. Epub 2022 Jan 17.
6
Pangenomics enables genotyping of known structural variants in 5202 diverse genomes.泛基因组学能够对 5202 个不同基因组中的已知结构变异进行基因分型。
Science. 2021 Dec 17;374(6574):abg8871. doi: 10.1126/science.abg8871.
7
The COVID-19 Data Portal: accelerating SARS-CoV-2 and COVID-19 research through rapid open access data sharing.COVID-19 数据门户:通过快速开放获取数据共享加速 SARS-CoV-2 和 COVID-19 研究。
Nucleic Acids Res. 2021 Jul 2;49(W1):W619-W623. doi: 10.1093/nar/gkab417.
8
Sustainable data analysis with Snakemake.使用 Snakemake 进行可持续数据分析。
F1000Res. 2021 Jan 18;10:33. doi: 10.12688/f1000research.29032.2. eCollection 2021.
9
gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections.gsufsort:为字符串集合构建后缀数组、最长公共前缀数组和Burrows-Wheeler变换
Algorithms Mol Biol. 2020 Sep 22;15:18. doi: 10.1186/s13015-020-00177-y. eCollection 2020.
10
Efficient Construction of a Complete Index for Pan-Genomics Read Alignment.高效构建全基因组读段比对的完整索引。
J Comput Biol. 2020 Apr;27(4):500-513. doi: 10.1089/cmb.2019.0309. Epub 2020 Mar 16.