使用 de Bruijn 图进行大型基因组的并行短读序列组装。

Parallelized short read assembly of large genomes using de Bruijn graphs.

机构信息

School of Computer Engineering, Nanyang Technological University, Singapore.

出版信息

BMC Bioinformatics. 2011 Aug 25;12:354. doi: 10.1186/1471-2105-12-354.

DOI:10.1186/1471-2105-12-354

PMID:21867511

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3167803/

Abstract

BACKGROUND

Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. However, existing assemblers require high execution times and a large amount of compute resources to assemble large genomes from quantities of short reads.

RESULTS

We present PASHA, a parallelized short read assembler using de Bruijn graphs, which takes advantage of hybrid computing architectures consisting of both shared-memory multi-core CPUs and distributed-memory compute clusters to gain efficiency and scalability. Evaluation using three small-scale real paired-end datasets shows that PASHA is able to produce more contiguous high-quality assemblies in shorter time compared to three leading assemblers: Velvet, ABySS and SOAPdenovo. PASHA's scalability for large genome datasets is demonstrated with human genome assembly. Compared to ABySS, PASHA achieves competitive assembly quality with faster execution speed on the same compute resources, yielding an NG50 contig size of 503 with the longest correct contig size of 18,252, and an NG50 scaffold size of 2,294. Moreover, the human assembly is completed in about 21 hours with only modest compute resources.

CONCLUSIONS

Developing parallel assemblers for large genomes has been garnering significant research efforts due to the explosive size growth of high-throughput short read datasets. By employing hybrid parallelism consisting of multi-threading on multi-core CPUs and message passing on compute clusters, PASHA is able to assemble the human genome with high quality and in reasonable time using modest compute resources.

摘要

背景

下一代测序技术使得 DNA 测序通量呈爆炸式增长，并促进了从头开始的短读序列组装程序的最新发展。然而，现有的组装程序需要很高的执行时间和大量的计算资源，才能从大量的短读序列中组装出大型基因组。

结果

我们提出了 PASHA，这是一种使用 de Bruijn 图的并行化短读序列组装程序，它利用了由共享内存多核 CPU 和分布式内存计算集群组成的混合计算架构，以提高效率和可扩展性。使用三个小规模真实的配对末端数据集进行评估表明，与三个领先的组装程序（Velvet、AByss 和 SOAPdenovo）相比，PASHA 能够在更短的时间内产生更多连续的高质量组装。PASHA 对大型基因组数据集的可扩展性通过人类基因组组装得到了证明。与 ABySS 相比，PASHA 在相同的计算资源上以更快的执行速度实现了具有竞争力的组装质量，产生了 503 的 NG50 连续体大小，最长正确连续体大小为 18252，NG50 支架大小为 2294。此外，仅使用适度的计算资源，人类基因组组装就可以在大约 21 小时内完成。

结论

由于高通量短读数据集的爆炸式增长，开发用于大型基因组的并行组装程序已经引起了广泛的研究关注。通过采用多线程多核 CPU 和计算集群上的消息传递相结合的混合并行性，PASHA 能够使用适度的计算资源以高质量和合理的时间组装人类基因组。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a899/3167803/fb469888fa2c/1471-2105-12-354-1.jpg

相似文献

Parallelized short read assembly of large genomes using de Bruijn graphs.

BMC Bioinformatics. 2011 Aug 25;12:354. doi: 10.1186/1471-2105-12-354.

FastEtch: A Fast Sketch-Based Assembler for Genomes.

IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1091-1106. doi: 10.1109/TCBB.2017.2737999. Epub 2017 Sep 11.

Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches.

BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):507. doi: 10.1186/s12864-016-2895-8.

Clover: a clustering-oriented de novo assembler for Illumina sequences.

BMC Bioinformatics. 2020 Nov 17;21(1):528. doi: 10.1186/s12859-020-03788-9.

Benchmarking and Assessment of Eight Genome Assemblers on Viral Next-Generation Sequencing Data, Including the SARS-CoV-2.

OMICS. 2022 Jul;26(7):372-381. doi: 10.1089/omi.2022.0042. Epub 2022 Jun 28.

Assembler for de novo assembly of large genomes.

Proc Natl Acad Sci U S A. 2013 Sep 3;110(36):E3417-24. doi: 10.1073/pnas.1314090110. Epub 2013 Aug 21.

RResolver: efficient short-read repeat resolution within ABySS.

BMC Bioinformatics. 2022 Jun 21;23(1):246. doi: 10.1186/s12859-022-04790-z.

DecGPU: distributed error correction on massively parallel graphics processing units using CUDA and MPI.

BMC Bioinformatics. 2011 Mar 29;12:85. doi: 10.1186/1471-2105-12-85.

Integration of string and de Bruijn graphs for genome assembly.

Bioinformatics. 2016 May 1;32(9):1301-7. doi: 10.1093/bioinformatics/btw011. Epub 2016 Jan 10.

B-assembler: a circular bacterial genome assembler.

BMC Genomics. 2022 May 11;23(Suppl 4):361. doi: 10.1186/s12864-022-08577-7.

引用本文的文献

Genome-wide comparative analyses of GATA transcription factors among 19 Arabidopsis ecotype genomes: Intraspecific characteristics of GATA transcription factors.

PLoS One. 2021 May 26;16(5):e0252181. doi: 10.1371/journal.pone.0252181. eCollection 2021.

The sialotranscriptome of the gopher-tortoise tick, Amblyomma tuberculatum.

Ticks Tick Borne Dis. 2021 Jan;12(1):101560. doi: 10.1016/j.ttbdis.2020.101560. Epub 2020 Sep 25.

Scalable Genome Assembly through Parallel de Bruijn Graph Construction for Multiple k-mers.

Sci Rep. 2019 Oct 16;9(1):14882. doi: 10.1038/s41598-019-51284-9.

Sialome diversity of ticks revealed by RNAseq of single tick salivary glands.

PLoS Negl Trop Dis. 2018 Apr 13;12(4):e0006410. doi: 10.1371/journal.pntd.0006410. eCollection 2018 Apr.

A Comprehensive Study of De Novo Genome Assemblers: Current Challenges and Future Prospective.

Evol Bioinform Online. 2018 Feb 20;14:1176934318758650. doi: 10.1177/1176934318758650. eCollection 2018.

Computing Platforms for Big Biological Data Analytics: Perspectives and Challenges.

Comput Struct Biotechnol J. 2017 Aug 14;15:403-411. doi: 10.1016/j.csbj.2017.07.004. eCollection 2017.

Survey of gene splicing algorithms based on reads.

Bioengineered. 2017 Nov 2;8(6):750-758. doi: 10.1080/21655979.2017.1373538. Epub 2017 Sep 21.

Compacting de Bruijn graphs from sequencing data quickly and in low memory.

Bioinformatics. 2016 Jun 15;32(12):i201-i208. doi: 10.1093/bioinformatics/btw279.

An NGS Workflow Blueprint for DNA Sequencing Data and Its Application in Individualized Molecular Oncology.

Cancer Inform. 2016 Apr 10;14(Suppl 5):87-107. doi: 10.4137/CIN.S30793. eCollection 2015.

An Insight into the Sialome of the Lone Star Tick, Amblyomma americanum, with a Glimpse on Its Time Dependent Gene Expression.

PLoS One. 2015 Jul 1;10(7):e0131292. doi: 10.1371/journal.pone.0131292. eCollection 2015.

本文引用的文献

Succinct data structures for assembling large genomes.

Bioinformatics. 2011 Feb 15;27(4):479-86. doi: 10.1093/bioinformatics/btq697. Epub 2011 Jan 17.

High-quality draft assemblies of mammalian genomes from massively parallel sequence data.

Proc Natl Acad Sci U S A. 2011 Jan 25;108(4):1513-8. doi: 10.1073/pnas.1017351108. Epub 2010 Dec 27.

PE-Assembler: de novo assembler using short paired-end reads.

Bioinformatics. 2011 Jan 15;27(2):167-74. doi: 10.1093/bioinformatics/btq626. Epub 2010 Dec 12.

Quake: quality-aware detection and correction of sequencing errors.

Genome Biol. 2010;11(11):R116. doi: 10.1186/gb-2010-11-11-r116. Epub 2010 Nov 29.

A parallel algorithm for error correction in high-throughput short-read data on CUDA-enabled graphics hardware.

J Comput Biol. 2010 Apr;17(4):603-15. doi: 10.1089/cmb.2009.0062.

Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de novo assembler.

PLoS One. 2009 Dec 22;4(12):e8407. doi: 10.1371/journal.pone.0008407.

De novo assembly of human genomes with massively parallel short read sequencing.

Genome Res. 2010 Feb;20(2):265-72. doi: 10.1101/gr.097261.109. Epub 2009 Dec 17.

A fast hybrid short read fragment assembly algorithm.

Bioinformatics. 2009 Sep 1;25(17):2279-80. doi: 10.1093/bioinformatics/btp374. Epub 2009 Jun 17.

ABySS: a parallel assembler for short read sequence data.

Genome Res. 2009 Jun;19(6):1117-23. doi: 10.1101/gr.089532.108. Epub 2009 Feb 27.

Accurate whole human genome sequencing using reversible terminator chemistry.

Nature. 2008 Nov 6;456(7218):53-9. doi: 10.1038/nature07517.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用 de Bruijn 图进行大型基因组的并行短读序列组装。

Parallelized short read assembly of large genomes using de Bruijn graphs.

机构信息

School of Computer Engineering, Nanyang Technological University, Singapore.