在并行计算平台上使用COMRAD对大型基因组数据集进行压缩

Compression of Large genomic datasets using COMRAD on Parallel Computing Platform.

作者信息

Biji Christopher Leela, Madhu Manu K, Vishnu Vineetha, K Satheesh Kumar, Nair Achuthsankar S

机构信息

Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram.

School of Computer Science, Mahathma Gandhi University, Kottayam.

出版信息

Bioinformation. 2015 May 28;11(5):267-71. doi: 10.6026/97320630011267. eCollection 2015.

DOI:10.6026/97320630011267

PMID:26124572

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4464544/

Abstract

UNLABELLED

The big data storage is a challenge in a post genome era. Hence, there is a need for high performance computing solutions for managing large genomic data. Therefore, it is of interest to describe a parallel-computing approach using message-passing library for distributing the different compression stages in clusters. The genomic compression helps to reduce the on disk"foot print" of large data volumes of sequences. This supports the computational infrastructure for a more efficient archiving. The approach was shown to find utility in 21 Eukaryotic genomes using stratified sampling in this report. The method achieves an average of 6-fold disk space reduction with three times better compression time than COMRAD.

AVAILABILITY

The source codes are written in C using message passing libraries and are available at https:// sourceforge.net/ projects/ comradmpi/files / COMRADMPI/.

摘要

未标注

在后基因组时代，大数据存储是一项挑战。因此，需要高性能计算解决方案来管理大型基因组数据。所以，描述一种使用消息传递库的并行计算方法以在集群中分配不同压缩阶段是很有意义的。基因组压缩有助于减少大量序列数据在磁盘上的“占用空间”。这为更高效的存档提供了计算基础设施支持。在本报告中，该方法通过分层抽样在21个真核生物基因组中显示出实用性。该方法平均可将磁盘空间减少6倍，压缩时间比COMRAD快三倍。

可用性

源代码用C语言编写，使用消息传递库，可在https://sourceforge.net/projects/comradmpi/files/COMRADMPI/获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2ad5/4464544/9acc8d7ecc73/97320630011267F1.jpg

相似文献

Compression of Large genomic datasets using COMRAD on Parallel Computing Platform.

Bioinformation. 2015 May 28;11(5):267-71. doi: 10.6026/97320630011267. eCollection 2015.

Iterative dictionary construction for compression of large DNA data sets.

IEEE/ACM Trans Comput Biol Bioinform. 2012 Jan-Feb;9(1):137-49. doi: 10.1109/TCBB.2011.82. Epub 2011 Apr 27.

HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data.

Biomed Res Int. 2019 Nov 16;2019:3108950. doi: 10.1155/2019/3108950. eCollection 2019.

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform.

Bioinformatics. 2012 Jun 1;28(11):1415-9. doi: 10.1093/bioinformatics/bts173. Epub 2012 May 3.

A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data.

Gigascience. 2015 Jun 4;4:26. doi: 10.1186/s13742-015-0058-5. eCollection 2015.

smallWig: parallel compression of RNA-seq WIG files.

Bioinformatics. 2016 Jan 15;32(2):173-80. doi: 10.1093/bioinformatics/btv561. Epub 2015 Sep 30.

GTZ: a fast compression and cloud transmission tool optimized for FASTQ files.

BMC Bioinformatics. 2017 Dec 28;18(Suppl 16):549. doi: 10.1186/s12859-017-1973-5.

Comrad: detection of expressed rearrangements by integrated analysis of RNA-Seq and low coverage genome sequence data.

Bioinformatics. 2011 Jun 1;27(11):1481-8. doi: 10.1093/bioinformatics/btr184. Epub 2011 Apr 9.

Parallel and Space-Efficient Construction of Burrows-Wheeler Transform and Suffix Array for Big Genome Data.

IEEE/ACM Trans Comput Biol Bioinform. 2016 May-Jun;13(3):592-8. doi: 10.1109/TCBB.2015.2430314.

Breeding and Genetics Symposium: really big data: processing and analysis of very large data sets.

J Anim Sci. 2012 Mar;90(3):723-33. doi: 10.2527/jas.2011-4584. Epub 2011 Nov 18.

本文引用的文献

Fast lossless compression via cascading Bloom filters.

BMC Bioinformatics. 2014;15 Suppl 9(Suppl 9):S7. doi: 10.1186/1471-2105-15-S9-S7. Epub 2014 Sep 10.

High-throughput DNA sequence data compression.

Brief Bioinform. 2015 Jan;16(1):1-15. doi: 10.1093/bib/bbt087. Epub 2013 Dec 3.

MFCompress: a compression tool for FASTA and multi-FASTA data.

Bioinformatics. 2014 Jan 1;30(1):117-8. doi: 10.1093/bioinformatics/btt594. Epub 2013 Oct 16.

'Big data', Hadoop and cloud computing in genomics.

J Biomed Inform. 2013 Oct;46(5):774-81. doi: 10.1016/j.jbi.2013.07.001. Epub 2013 Jul 18.

Biology: The big challenges of big data.

Nature. 2013 Jun 13;498(7453):255-60. doi: 10.1038/498255a.

Iterative dictionary construction for compression of large DNA data sets.

IEEE/ACM Trans Comput Biol Bioinform. 2012 Jan-Feb;9(1):137-49. doi: 10.1109/TCBB.2011.82. Epub 2011 Apr 27.

Efficient storage of high throughput DNA sequencing data using reference-based compression.

Genome Res. 2011 May;21(5):734-40. doi: 10.1101/gr.114819.110. Epub 2011 Jan 18.

Computational solutions to large-scale data management and analysis.

Nat Rev Genet. 2010 Sep;11(9):647-57. doi: 10.1038/nrg2857.

Challenges of sequencing human genomes.

Brief Bioinform. 2010 Sep;11(5):484-98. doi: 10.1093/bib/bbq016. Epub 2010 Jun 2.

How repetitive are genomes?

BMC Bioinformatics. 2006 Dec 22;7:541. doi: 10.1186/1471-2105-7-541.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

在并行计算平台上使用COMRAD对大型基因组数据集进行压缩

Compression of Large genomic datasets using COMRAD on Parallel Computing Platform.

作者信息

机构信息

出版信息

UNLABELLED

AVAILABILITY

未标注

可用性

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献