• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

deGSM:大规模 de Bruijn 图的可扩展存储构建。

deGSM: Memory Scalable Construction Of Large Scale de Bruijn Graph.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2157-2166. doi: 10.1109/TCBB.2019.2913932. Epub 2021 Dec 8.

DOI:10.1109/TCBB.2019.2913932
PMID:31056509
Abstract

The de Bruijn graph, a fundamental data structure to represent and organize genome sequence, plays important roles in various kinds of sequence analysis tasks. With the rapid development of HTS data and ever-increasing number of assembled genomes, there is a high demand to construct the very large de Bruijn graph for sequences up to Tera-base-pair level. Current approaches may have unaffordable memory footprints to handle such a large de Bruijn graph. We propose a lightweight parallel de Bruijn graph construction approach: de Bruijn Graph Constructor in Scalable Memory (deGSM). The main idea of deGSM is to efficiently construct the Burrows-Wheeler Transformation (BWT) of the unipaths of the de Bruijn graph in constant RAM space and transform the BWT into the original unitigs. The experimental results demonstrate that, just with a commonly available machine, deGSM is able to handle very large genome sequence(s), e.g., the contigs (305 Gbp) and scaffolds (1.1 Tbp) recorded in GenBank database and Picea abies HTS dataset (9.7 Tbp). Moreover, deGSM also has faster or comparable construction speed compared with state-of-the-art approaches. With its high scalability and efficiency, deGSM has enormous potential in many large scale genomics studies. The deGSM is publicly available at: https://github.com/hitbc/deGSM.

摘要

de Bruijn 图是一种用于表示和组织基因组序列的基本数据结构,在各种序列分析任务中发挥着重要作用。随着高通量测序数据的快速发展和组装基因组数量的不断增加,构建高达太字节级别的序列的非常大的 de Bruijn 图的需求很高。当前的方法可能无法承受处理如此大的 de Bruijn 图所需的内存足迹。我们提出了一种轻量级的并行 de Bruijn 图构建方法:可扩展内存中的 de Bruijn 图构建器(deGSM)。deGSM 的主要思想是在恒定的 RAM 空间中有效地构建 de Bruijn 图的单路径的 Burrows-Wheeler 变换(BWT),并将 BWT 转换为原始单元。实验结果表明,仅使用常见的机器,deGSM 就能够处理非常大的基因组序列,例如 GenBank 数据库和云杉 HTS 数据集(9.7 Tbp)中记录的 contigs(305 Gbp)和 scaffolds(1.1 Tbp)。此外,与最先进的方法相比,deGSM 还具有更快或相当的构建速度。deGSM 具有高度的可扩展性和效率,在许多大规模基因组学研究中具有巨大的潜力。deGSM 可在以下网址获得:https://github.com/hitbc/deGSM。

相似文献

1
deGSM: Memory Scalable Construction Of Large Scale de Bruijn Graph.deGSM:大规模 de Bruijn 图的可扩展存储构建。
IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2157-2166. doi: 10.1109/TCBB.2019.2913932. Epub 2021 Dec 8.
2
deBWT: parallel construction of Burrows-Wheeler Transform for large collection of genomes with de Bruijn-branch encoding.deBWT:用于大量基因组集合的具有德布鲁因分支编码的Burrows-Wheeler变换的并行构建。
Bioinformatics. 2016 Jun 15;32(12):i174-i182. doi: 10.1093/bioinformatics/btw266.
3
Simplitigs as an efficient and scalable representation of de Bruijn graphs.Simplitigs 作为一种高效且可扩展的 de Bruijn 图表示方法。
Genome Biol. 2021 Apr 6;22(1):96. doi: 10.1186/s13059-021-02297-z.
4
Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections.乌贼算法:从大规模基因组集合中快速、并行且低内存消耗的 de Bruijn 图压缩。
Bioinformatics. 2021 Jul 12;37(Suppl_1):i177-i186. doi: 10.1093/bioinformatics/btab309.
5
A space and time-efficient index for the compacted colored de Bruijn graph.一种用于压缩彩色 de Bruijn 图的空间和时间高效索引。
Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292.
6
Fast de Bruijn Graph Compaction in Distributed Memory Environments.快速有向无环图压缩在分布式内存环境中。
IEEE/ACM Trans Comput Biol Bioinform. 2020 Jan-Feb;17(1):136-148. doi: 10.1109/TCBB.2018.2858797. Epub 2018 Jul 31.
7
Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs.用于构建大型双向 de Bruijn 图的高效并行和外核算法。
BMC Bioinformatics. 2010 Nov 15;11:560. doi: 10.1186/1471-2105-11-560.
8
Building large updatable colored de Bruijn graphs via merging.通过合并构建大型可更新彩色 de Bruijn 图。
Bioinformatics. 2019 Jul 15;35(14):i51-i60. doi: 10.1093/bioinformatics/btz350.
9
Integrating long-range connectivity information into de Bruijn graphs.将长程连接信息整合到 de Bruijn 图中。
Bioinformatics. 2018 Aug 1;34(15):2556-2565. doi: 10.1093/bioinformatics/bty157.
10
TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes.TwoPaCo:一种从多个完整基因组构建紧凑的 de Bruijn 图的高效算法。
Bioinformatics. 2017 Dec 15;33(24):4024-4032. doi: 10.1093/bioinformatics/btw609.

引用本文的文献

1
Conway-Bromage-Lyndon (CBL): an exact, dynamic representation of k-mer sets.康威-布罗姆-林登 (CBL):一种精确的、动态的 k-mer 集表示方法。
Bioinformatics. 2024 Jun 28;40(Suppl 1):i48-i57. doi: 10.1093/bioinformatics/btae217.
2
Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT.使用 GGCAT 实现紧凑且着色的 de Bruijn 图的快速构建和查询。
Genome Res. 2023 Jul;33(7):1198-1207. doi: 10.1101/gr.277615.122. Epub 2023 May 30.
3
Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2.
使用 Cuttlefish 2 实现可扩展、超快速和低内存消耗的紧凑 de Bruijn 图构建。
Genome Biol. 2022 Sep 8;23(1):190. doi: 10.1186/s13059-022-02743-6.
4
SPRISS: approximating frequent k-mers by sampling reads, and applications.SPRISS:通过读取采样来近似频繁的 k-mers 及其应用。
Bioinformatics. 2022 Jun 27;38(13):3343-3350. doi: 10.1093/bioinformatics/btac180.
5
Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections.乌贼算法:从大规模基因组集合中快速、并行且低内存消耗的 de Bruijn 图压缩。
Bioinformatics. 2021 Jul 12;37(Suppl_1):i177-i186. doi: 10.1093/bioinformatics/btab309.
6
Simplitigs as an efficient and scalable representation of de Bruijn graphs.Simplitigs 作为一种高效且可扩展的 de Bruijn 图表示方法。
Genome Biol. 2021 Apr 6;22(1):96. doi: 10.1186/s13059-021-02297-z.
7
Representation of -Mer Sets Using Spectrum-Preserving String Sets.使用谱保持串集表示 -Mer 集。
J Comput Biol. 2021 Apr;28(4):381-394. doi: 10.1089/cmb.2020.0431. Epub 2020 Dec 7.
8
Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs.Bifrost:彩色紧凑布隆图的高度并行构建和索引
Genome Biol. 2020 Sep 17;21(1):249. doi: 10.1186/s13059-020-02135-8.
9
Super-Pangenome by Integrating the Wild Side of a Species for Accelerated Crop Improvement.超级泛基因组:整合物种的野生侧群,加速作物改良。
Trends Plant Sci. 2020 Feb;25(2):148-158. doi: 10.1016/j.tplants.2019.10.012. Epub 2019 Nov 29.
10
Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models.雅典娜:使用语言模型自动调整基于 k-mer 的基因组纠错算法。
Sci Rep. 2019 Nov 6;9(1):16157. doi: 10.1038/s41598-019-52196-4.