• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于构建大型双向 de Bruijn 图的高效并行和外核算法。

Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs.

机构信息

Department of Computer Science and Engineering, University of Connecticut, 371 Fairfield Way, U-2155, Storrs, CT 06269, USA.

出版信息

BMC Bioinformatics. 2010 Nov 15;11:560. doi: 10.1186/1471-2105-11-560.

DOI:10.1186/1471-2105-11-560
PMID:21078174
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2996408/
Abstract

BACKGROUND

Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories - based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an O(n/p) time parallel algorithm has been given for this problem. Here n is the size of the input and p is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Θ(nΣ) messages (Σ being the size of the alphabet).

RESULTS

In this paper we present a Θ(n/p) time parallel algorithm with a communication complexity that is equal to that of parallel sorting and is not sensitive to Σ. The generality of our algorithm makes it very easy to extend it even to the out-of-core model and in this case it has an optimal I/O complexity of Θ(nlog(n/B)Blog(M/B)) (M being the main memory size and B being the size of the disk block). We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with the previous approaches reveals that our algorithm is faster--both asymptotically and practically. We demonstrate the scalability of our sequential out-of-core algorithm by comparing it with the algorithm used by VELVET to build the bi-directed de Bruijn graph. Our experiments reveal that our algorithm can build the graph with a constant amount of memory, which clearly outperforms VELVET. We also provide efficient algorithms for the bi-directed chain compaction problem.

CONCLUSIONS

The bi-directed de Bruijn graph is a fundamental data structure for any sequence assembly program based on Eulerian approach. Our algorithms for constructing Bi-directed de Bruijn graphs are efficient in parallel and out of core settings. These algorithms can be used in building large scale bi-directed de Bruijn graphs. Furthermore, our algorithms do not employ any all-to-all communications in a parallel setting and perform better than the prior algorithms. Finally our out-of-core algorithm is extremely memory efficient and can replace the existing graph construction algorithm in VELVET.

摘要

背景

将一组重叠读取的基因组序列组装在一起是计算生物学中最基本的问题之一。解决组装问题的算法分为两类 - 基于它们使用的数据结构。第一类使用重叠/字符串图,第二类使用 de Bruijn 图。然而,随着短读测序技术的最新进展,基于 de Bruijn 图的算法在实践中似乎起着至关重要的作用。在基于短读的大型测序项目中,构建这些大规模 de Bruijn 图的高效算法是非常必要的。在早期的工作中,已经给出了一种用于此问题的 O(n/p)时间并行算法。这里 n 是输入的大小,p 是处理器的数量。该算法枚举可以与节点重叠的所有可能的双向边,并最终生成 Θ(nΣ)条消息(Σ 是字母表的大小)。

结果

在本文中,我们提出了一种具有与并行排序相同通信复杂度的 Θ(n/p)时间并行算法,并且对 Σ 不敏感。我们算法的通用性使得它非常容易扩展到核外模型,在这种情况下,它具有最优的 I/O 复杂度为 Θ(nlog(n/B)Blog(M/B))(M 是主内存大小,B 是磁盘块的大小)。我们在 SGI/Altix 计算机上展示了我们的并行算法的可扩展性。将我们的算法与以前的方法进行比较表明,我们的算法无论是在理论上还是在实践中都更快。我们通过将其与 VELVET 用于构建双向 de Bruijn 图的算法进行比较,展示了我们的顺序核外算法的可扩展性。我们的实验表明,我们的算法可以使用恒定数量的内存构建图形,这明显优于 VELVET。我们还提供了用于双向链紧缩问题的有效算法。

结论

双向 de Bruijn 图是基于 Eulerian 方法的任何序列组装程序的基本数据结构。我们用于构建双向 de Bruijn 图的算法在并行和核外环境中都非常高效。这些算法可用于构建大规模双向 de Bruijn 图。此外,我们的算法在并行设置中不使用任何全对全通信,并且比以前的算法表现更好。最后,我们的核外算法具有极高的内存效率,可以替代 VELVET 中现有的图构建算法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ffe/2996408/31b31144467f/1471-2105-11-560-7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ffe/2996408/b7a03f5cb21d/1471-2105-11-560-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ffe/2996408/d6bd6f3f8b8f/1471-2105-11-560-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ffe/2996408/e592569add6e/1471-2105-11-560-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ffe/2996408/e4e7970afac8/1471-2105-11-560-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ffe/2996408/99694cf3fa67/1471-2105-11-560-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ffe/2996408/1851cb305efc/1471-2105-11-560-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ffe/2996408/31b31144467f/1471-2105-11-560-7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ffe/2996408/b7a03f5cb21d/1471-2105-11-560-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ffe/2996408/d6bd6f3f8b8f/1471-2105-11-560-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ffe/2996408/e592569add6e/1471-2105-11-560-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ffe/2996408/e4e7970afac8/1471-2105-11-560-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ffe/2996408/99694cf3fa67/1471-2105-11-560-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ffe/2996408/1851cb305efc/1471-2105-11-560-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6ffe/2996408/31b31144467f/1471-2105-11-560-7.jpg

相似文献

1
Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs.用于构建大型双向 de Bruijn 图的高效并行和外核算法。
BMC Bioinformatics. 2010 Nov 15;11:560. doi: 10.1186/1471-2105-11-560.
2
AN EFFICIENT ALGORITHM FOR CHINESE POSTMAN WALK ON BI-DIRECTED DE BRUIJN GRAPHS.一种在双向德布鲁因图上的中国邮路问题的高效算法。
Discrete Math Algorithms Appl. 2010;1:184-196. doi: 10.1007/978-3-642-17458-2_16.
3
Fast de Bruijn Graph Compaction in Distributed Memory Environments.快速有向无环图压缩在分布式内存环境中。
IEEE/ACM Trans Comput Biol Bioinform. 2020 Jan-Feb;17(1):136-148. doi: 10.1109/TCBB.2018.2858797. Epub 2018 Jul 31.
4
On the Hardness of Sequence Alignment on De Bruijn Graphs.基于 De Bruijn 图的序列比对问题的难度研究
J Comput Biol. 2022 Dec;29(12):1377-1396. doi: 10.1089/cmb.2022.0411. Epub 2022 Nov 25.
5
Velvet: algorithms for de novo short read assembly using de Bruijn graphs.《天鹅绒:使用德布鲁因图进行从头短读长拼接的算法》
Genome Res. 2008 May;18(5):821-9. doi: 10.1101/gr.074492.107. Epub 2008 Mar 18.
6
FastEtch: A Fast Sketch-Based Assembler for Genomes.FastEtch:一种基于草图的快速基因组装配器。
IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1091-1106. doi: 10.1109/TCBB.2017.2737999. Epub 2017 Sep 11.
7
Integrating long-range connectivity information into de Bruijn graphs.将长程连接信息整合到 de Bruijn 图中。
Bioinformatics. 2018 Aug 1;34(15):2556-2565. doi: 10.1093/bioinformatics/bty157.
8
Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches.用于纳米孔数据的从头组装算法基准测试揭示了重叠布局一致(OLC)方法的最佳性能。
BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):507. doi: 10.1186/s12864-016-2895-8.
9
HyDA-Vista: towards optimal guided selection of k-mer size for sequence assembly.HyDA-Vista:迈向序列组装中k-mer大小的最优引导选择
BMC Genomics. 2014;15 Suppl 10(Suppl 10):S9. doi: 10.1186/1471-2164-15-S10-S9. Epub 2014 Dec 12.
10
Read mapping on de Bruijn graphs.在德布鲁因图上进行读段映射。
BMC Bioinformatics. 2016 Jun 16;17(1):237. doi: 10.1186/s12859-016-1103-9.

引用本文的文献

1
Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections.乌贼算法:从大规模基因组集合中快速、并行且低内存消耗的 de Bruijn 图压缩。
Bioinformatics. 2021 Jul 12;37(Suppl_1):i177-i186. doi: 10.1093/bioinformatics/btab309.
2
Compacting de Bruijn graphs from sequencing data quickly and in low memory.从测序数据中快速且低内存地压缩德布鲁因图。
Bioinformatics. 2016 Jun 15;32(12):i201-i208. doi: 10.1093/bioinformatics/btw279.
3
Breaking the computational barriers of pairwise genome comparison.

本文引用的文献

1
ABySS: a parallel assembler for short read sequence data.ABySS:一种用于短读长序列数据的并行汇编器。
Genome Res. 2009 Jun;19(6):1117-23. doi: 10.1101/gr.089532.108. Epub 2009 Feb 27.
2
Velvet: algorithms for de novo short read assembly using de Bruijn graphs.《天鹅绒:使用德布鲁因图进行从头短读长拼接的算法》
Genome Res. 2008 May;18(5):821-9. doi: 10.1101/gr.074492.107. Epub 2008 Mar 18.
3
PCAP: a whole-genome assembly program.PCAP:一个全基因组组装程序。
突破成对基因组比较的计算障碍。
BMC Bioinformatics. 2015 Aug 11;16(1):250. doi: 10.1186/s12859-015-0679-9.
4
Exploring variation-aware contig graphs for (comparative) metagenomics using MaryGold.利用 MaryGold 探索(比较)宏基因组学中的变异感知重叠群图。
Bioinformatics. 2013 Nov 15;29(22):2826-34. doi: 10.1093/bioinformatics/btt502. Epub 2013 Sep 20.
Genome Res. 2003 Sep;13(9):2164-70. doi: 10.1101/gr.1390403.
4
ARACHNE: a whole-genome shotgun assembler.ARACHNE:一种全基因组鸟枪法测序序列拼接程序。
Genome Res. 2002 Jan;12(1):177-89. doi: 10.1101/gr.208902.
5
An Eulerian path approach to DNA fragment assembly.一种用于DNA片段组装的欧拉路径方法。
Proc Natl Acad Sci U S A. 2001 Aug 14;98(17):9748-53. doi: 10.1073/pnas.171285098.
6
A whole-genome assembly of Drosophila.果蝇的全基因组组装
Science. 2000 Mar 24;287(5461):2196-204. doi: 10.1126/science.287.5461.2196.