SparkGC：基于 Spark 的基因组压缩方法，适用于大规模基因组集合。

SparkGC: Spark based genome compression for large collections of genomes.

机构信息

School of Computer and Software, Nanjing Vocational University of Industry Technology, Nanjing, 210023, China.

School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China.

出版信息

BMC Bioinformatics. 2022 Jul 25;23(1):297. doi: 10.1186/s12859-022-04825-5.

DOI:10.1186/s12859-022-04825-5

PMID:35879669

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9310413/

Abstract

Since the completion of the Human Genome Project at the turn of the century, there has been an unprecedented proliferation of sequencing data. One of the consequences is that it becomes extremely difficult to store, backup, and migrate enormous amount of genomic datasets, not to mention they continue to expand as the cost of sequencing decreases. Herein, a much more efficient and scalable program to perform genome compression is required urgently. In this manuscript, we propose a new Apache Spark based Genome Compression method called SparkGC that can run efficiently and cost-effectively on a scalable computational cluster to compress large collections of genomes. SparkGC uses Spark's in-memory computation capabilities to reduce compression time by keeping data active in memory between the first-order and second-order compression. The evaluation shows that the compression ratio of SparkGC is better than the best state-of-the-art methods, at least better by 30%. The compression speed is also at least 3.8 times that of the best state-of-the-art methods on only one worker node and scales quite well with the number of nodes. SparkGC is of significant benefit to genomic data storage and transmission. The source code of SparkGC is publicly available at https://github.com/haichangyao/SparkGC .

摘要

自本世纪初人类基因组计划完成以来，测序数据呈前所未有的激增。其结果之一是，存储、备份和迁移大量基因组数据集变得极其困难，更不用说随着测序成本的降低，这些数据集还在不断扩大。因此，迫切需要一种更高效、更具可扩展性的基因组压缩程序。在本文中，我们提出了一种新的基于 Apache Spark 的基因组压缩方法，称为 SparkGC，它可以在可扩展的计算集群上高效、经济地运行，以压缩大量基因组集合。SparkGC 利用 Spark 的内存计算能力，通过在一级和二级压缩之间将数据保持在内存中，从而减少压缩时间。评估表明，SparkGC 的压缩比优于最佳的现有方法，至少好 30%。在仅一个工作节点上，压缩速度也至少比最佳的现有方法快 3.8 倍，并且随着节点数量的增加而很好地扩展。SparkGC 对基因组数据的存储和传输具有重要意义。SparkGC 的源代码可在 https://github.com/haichangyao/SparkGC 上公开获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6e07/9310413/1c8044694069/12859_2022_4825_Fig1_HTML.jpg

相似文献

SparkGC: Spark based genome compression for large collections of genomes.SparkGC：基于 Spark 的基因组压缩方法，适用于大规模基因组集合。

BMC Bioinformatics. 2022 Jul 25;23(1):297. doi: 10.1186/s12859-022-04825-5.

High-speed and high-ratio referential genome compression.高速高比参照基因组压缩。

Bioinformatics. 2017 Nov 1;33(21):3364-3372. doi: 10.1093/bioinformatics/btx412.

SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework.SparkGA2：基于 Apache Spark 的生产质量、内存高效的基因组分析框架。

PLoS One. 2019 Dec 5;14(12):e0224784. doi: 10.1371/journal.pone.0224784. eCollection 2019.

A new efficient referential genome compression technique for FastQ files.一种用于 FastQ 文件的新型高效参照基因组压缩技术。

Funct Integr Genomics. 2023 Nov 11;23(4):333. doi: 10.1007/s10142-023-01259-x.

PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping.PQSDC：一种通过序列划分和游程长度预测映射对质量分数数据进行并行无损压缩的方法。

Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae323.

Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment.压缩泛基因组的分布式混合索引，实现可扩展和快速的序列比对。

PLoS One. 2021 Aug 3;16(8):e0255260. doi: 10.1371/journal.pone.0255260. eCollection 2021.

VC@Scale: Scalable and high-performance variant calling on cluster environments.VC@Scale：在集群环境中进行可扩展且高性能的变体调用。

Gigascience. 2021 Sep 7;10(9). doi: 10.1093/gigascience/giab057.

Sketch distance-based clustering of chromosomes for large genome database compression.基于草图距离的染色体聚类在大型基因组数据库压缩中的应用。

BMC Genomics. 2019 Dec 30;20(Suppl 10):978. doi: 10.1186/s12864-019-6310-0.

ADS-HCSpark: A scalable HaplotypeCaller leveraging adaptive data segmentation to accelerate variant calling on Spark.ADS-HCSpark：一种可扩展的基于 Spark 的单倍型调用程序，利用自适应数据分段来加速变异调用。

BMC Bioinformatics. 2019 Feb 14;20(1):76. doi: 10.1186/s12859-019-2665-0.

GVC: efficient random access compression for gene sequence variations.GVC：基因序列变异的高效随机访问压缩。

BMC Bioinformatics. 2023 Mar 28;24(1):121. doi: 10.1186/s12859-023-05240-0.

引用本文的文献

Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity.基于稀疏性不对称压缩增强基因组突变数据存储优化

Front Genet. 2023 Jun 1;14:1213907. doi: 10.3389/fgene.2023.1213907. eCollection 2023.

Framing Apache Spark in life sciences.从生命科学角度构建Apache Spark

Heliyon. 2023 Feb 9;9(2):e13368. doi: 10.1016/j.heliyon.2023.e13368. eCollection 2023 Feb.

本文引用的文献

Integrative genomic analyses identify susceptibility genes underlying COVID-19 hospitalization.整合基因组分析确定 COVID-19 住院相关易感性基因。

Nat Commun. 2021 Jul 27;12(1):4569. doi: 10.1038/s41467-021-24824-z.

Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression.基因组短读段的汉明移位图：高效构建及其在压缩中的应用

PLoS Comput Biol. 2021 Jul 19;17(7):e1009229. doi: 10.1371/journal.pcbi.1009229. eCollection 2021 Jul.

Allowing mutations in maximal matches boosts genome compression performance.允许最大匹配中的突变可以提高基因组压缩性能。

Bioinformatics. 2020 Sep 15;36(18):4675-4681. doi: 10.1093/bioinformatics/btaa572.

Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers.Minirmd：一种通过多个 minimizers 进行短读段准确快速去重的工具。

Bioinformatics. 2021 Jul 12;37(11):1604-1606. doi: 10.1093/bioinformatics/btaa915.

HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data.HRCM：一种用于基因组大数据的高效混合参考压缩方法。

Biomed Res Int. 2019 Nov 16;2019:3108950. doi: 10.1155/2019/3108950. eCollection 2019.

Sketch distance-based clustering of chromosomes for large genome database compression.基于草图距离的染色体聚类在大型基因组数据库压缩中的应用。

BMC Genomics. 2019 Dec 30;20(Suppl 10):978. doi: 10.1186/s12864-019-6310-0.

SparkGA2: Production-quality memory-efficient Apache Spark based genome analysis framework.SparkGA2：基于 Apache Spark 的生产质量、内存高效的基因组分析框架。

PLoS One. 2019 Dec 5;14(12):e0224784. doi: 10.1371/journal.pone.0224784. eCollection 2019.

Fast detection of maximal exact matches via fixed sampling of query K-mers and Bloom filtering of index K-mers.通过查询 K -mer 的固定采样和索引 K-mer 的布隆过滤实现最大精确匹配的快速检测。

Bioinformatics. 2019 Nov 1;35(22):4560-4567. doi: 10.1093/bioinformatics/btz273.

High efficiency referential genome compression algorithm.高效引用基因组压缩算法。

Bioinformatics. 2019 Jun 1;35(12):2058-2065. doi: 10.1093/bioinformatics/bty934.

Clustering-Based Compression for Population DNA Sequences.基于聚类的群体 DNA 序列压缩。

IEEE/ACM Trans Comput Biol Bioinform. 2019 Jan-Feb;16(1):208-221. doi: 10.1109/TCBB.2017.2762302. Epub 2017 Oct 12.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

SparkGC：基于 Spark 的基因组压缩方法，适用于大规模基因组集合。

SparkGC: Spark based genome compression for large collections of genomes.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献