• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

万亿碱基规模下的BWT构建与搜索。

BWT construction and search at the terabase scale.

作者信息

Li Heng

机构信息

Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215, United States.

Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States.

出版信息

Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae717.

DOI:10.1093/bioinformatics/btae717
PMID:39607778
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11646566/
Abstract

MOTIVATION

Burrows-Wheeler Transform (BWT) is a common component in full-text indices. Initially developed for data compression, it is particularly powerful for encoding redundant sequences such as pangenome data. However, BWT construction is resource intensive and hard to be parallelized, and many methods for querying large full-text indices only report exact matches or their simple extensions. These limitations have hampered the biological applications of full-text indices.

RESULTS

We developed ropebwt3 for efficient BWT construction and query. Ropebwt3 indexed 320 assembled human genomes in 65 h and indexed 7.3 terabases of commonly studied bacterial assemblies in 26 days. This was achieved using up to 170 gigabytes of memory at the peak without working disk space. Ropebwt3 can find maximal exact matches and inexact alignments under affine-gap penalties, and can retrieve similar local haplotypes matching a query sequence. It demonstrates the feasibility of full-text indexing at the terabase scale.

AVAILABILITY AND IMPLEMENTATION

https://github.com/lh3/ropebwt3.

摘要

动机

Burrows-Wheeler变换(BWT)是全文索引中的常见组件。它最初是为数据压缩而开发的,对于编码冗余序列(如泛基因组数据)特别有效。然而,BWT构建资源密集且难以并行化,并且许多查询大型全文索引的方法仅报告精确匹配或其简单扩展。这些限制阻碍了全文索引在生物学中的应用。

结果

我们开发了ropebwt3用于高效的BWT构建和查询。ropebwt3在65小时内索引了320个组装好的人类基因组,并在26天内索引了7.3万亿碱基的常用细菌组装数据。这是在不使用工作磁盘空间的情况下,通过峰值时使用高达170GB的内存实现的。ropebwt3可以在仿射间隙罚分下找到最大精确匹配和不精确比对,并且可以检索与查询序列匹配的相似局部单倍型。它证明了在万亿碱基规模上进行全文索引的可行性。

可用性与实现

https://github.com/lh3/ropebwt3 。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0da9/11646566/ac2c44aed375/btae717f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0da9/11646566/ec3a9753dbf8/btae717f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0da9/11646566/4378949d4e97/btae717f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0da9/11646566/ac2c44aed375/btae717f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0da9/11646566/ec3a9753dbf8/btae717f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0da9/11646566/4378949d4e97/btae717f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0da9/11646566/ac2c44aed375/btae717f3.jpg

相似文献

1
BWT construction and search at the terabase scale.万亿碱基规模下的BWT构建与搜索。
Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae717.
2
Short-Term Memory Impairment短期记忆障碍
3
CREMSA: compressed indexing of (ultra) large multiple sequence alignments.CREMSA:(超)大型多序列比对的压缩索引
Bioinformatics. 2025 Jul 1;41(Supplement_1):i246-i254. doi: 10.1093/bioinformatics/btaf211.
4
Technological aids for the rehabilitation of memory and executive functioning in children and adolescents with acquired brain injury.脑损伤儿童和青少年记忆与执行功能康复的技术辅助手段。
Cochrane Database Syst Rev. 2016 Jul 1;7(7):CD011020. doi: 10.1002/14651858.CD011020.pub2.
5
The Black Book of Psychotropic Dosing and Monitoring.《精神药物剂量与监测黑皮书》
Psychopharmacol Bull. 2024 Jul 8;54(3):8-59.
6
Can a Liquid Biopsy Detect Circulating Tumor DNA With Low-passage Whole-genome Sequencing in Patients With a Sarcoma? A Pilot Evaluation.液体活检能否通过低深度全基因组测序检测肉瘤患者的循环肿瘤DNA?一项初步评估。
Clin Orthop Relat Res. 2025 Jan 1;483(1):39-48. doi: 10.1097/CORR.0000000000003161. Epub 2024 Jun 21.
7
Conjunctival autograft for pterygium.翼状胬肉的结膜自体移植术。
Cochrane Database Syst Rev. 2016 Feb 11;2(2):CD011349. doi: 10.1002/14651858.CD011349.pub2.
8
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
9
Home treatment for mental health problems: a systematic review.心理健康问题的居家治疗:一项系统综述
Health Technol Assess. 2001;5(15):1-139. doi: 10.3310/hta5150.
10
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

引用本文的文献

1
Efficient sequence alignment against millions of prokaryotic genomes with LexicMap.使用LexicMap与数百万个原核生物基因组进行高效序列比对。
Nat Biotechnol. 2025 Sep 10. doi: 10.1038/s41587-025-02812-8.
2
Finding easy regions for short-read variant calling from pangenome data.从泛基因组数据中寻找易于进行短读变异检测的区域。
Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf103.
3
Finding easy regions for short-read variant calling from pangenome data.从泛基因组数据中寻找易于进行短读变异检测的区域。

本文引用的文献

1
Building pangenome graphs.构建泛基因组图谱。
Nat Methods. 2024 Nov;21(11):2008-2012. doi: 10.1038/s41592-024-02430-3. Epub 2024 Oct 21.
2
How to Find Long Maximal Exact Matches and Ignore Short Ones.如何找到长的最大精确匹配并忽略短的匹配。
Dev Lang Theory. 2024 Aug;14791:131-140. doi: 10.1007/978-3-031-66159-4_10. Epub 2024 Jul 27.
3
Exploring gene content with pangene graphs.利用泛基因图探索基因内容。
ArXiv. 2025 Aug 8:arXiv:2507.03718v2.
4
Mumemto: efficient maximal matching across pangenomes.Mumemto:跨泛基因组的高效最大匹配
Genome Biol. 2025 Jun 17;26(1):169. doi: 10.1186/s13059-025-03644-0.
5
Lossless Pangenome Indexing Using Tag Arrays.使用标签数组的无损全基因组索引
bioRxiv. 2025 May 15:2025.05.12.653561. doi: 10.1101/2025.05.12.653561.
6
Run-length compressed metagenomic read classification with SMEM-finding and tagging.基于SMEM查找和标记的游程长度压缩宏基因组读取分类
bioRxiv. 2025 Mar 24:2025.02.25.640119. doi: 10.1101/2025.02.25.640119.
7
Dynamic -PBWT: Dynamic Run-length Compressed PBWT for Biobank Scale Data.动态 -PBWT:用于生物样本库规模数据的动态游程长度编码PBWT
bioRxiv. 2025 Feb 8:2025.02.04.636479. doi: 10.1101/2025.02.04.636479.
8
Haplotype Matching with GBWT for Pangenome Graphs.用于泛基因组图的基于广义布隆游走树的单倍型匹配
bioRxiv. 2025 Feb 7:2025.02.03.634410. doi: 10.1101/2025.02.03.634410.
9
Mumemto: efficient maximal matching across pangenomes.Mumemto:跨全基因组的高效最大匹配
bioRxiv. 2025 Jan 5:2025.01.05.631388. doi: 10.1101/2025.01.05.631388.
Bioinformatics. 2024 Jul 23;40(7). doi: 10.1093/bioinformatics/btae456.
4
A survey of BWT variants for string collections.针对字符串集合的BWT变体调查。
Bioinformatics. 2024 May 24;40(7). doi: 10.1093/bioinformatics/btae333.
5
Indexing and searching petabase-scale nucleotide resources.对 petabase 规模的核苷酸资源进行索引和搜索。
Nat Methods. 2024 Jun;21(6):994-1002. doi: 10.1038/s41592-024-02280-z. Epub 2024 May 16.
6
Fulgor: a fast and compact k-mer index for large-scale matching and color queries.Fulgor:一种用于大规模匹配和颜色查询的快速紧凑的k-mer索引。
Algorithms Mol Biol. 2024 Jan 22;19(1):3. doi: 10.1186/s13015-024-00251-9.
7
A draft human pangenome reference.人类泛基因组参考草图。
Nature. 2023 May;617(7960):312-324. doi: 10.1038/s41586-023-05896-x. Epub 2023 May 10.
8
Pangenome graph construction from genome alignments with Minigraph-Cactus.基于 Minigraph-Cactus 的基因组比对构建泛基因组图谱。
Nat Biotechnol. 2024 Apr;42(4):663-673. doi: 10.1038/s41587-023-01793-w. Epub 2023 May 10.
9
AGC: compact representation of assembled genomes with fast queries and updates.AGC:带快速查询和更新功能的组装基因组的紧凑表示。
Bioinformatics. 2023 Mar 1;39(3). doi: 10.1093/bioinformatics/btad097.
10
The complete sequence of a human genome.人类基因组的完整序列。
Science. 2022 Apr;376(6588):44-53. doi: 10.1126/science.abj6987. Epub 2022 Mar 31.