School of Computer and Software, Nanjing Vocational University of Industry Technology, Nanjing, 210023, China.
School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China.
BMC Bioinformatics. 2022 Jul 25;23(1):297. doi: 10.1186/s12859-022-04825-5.
Since the completion of the Human Genome Project at the turn of the century, there has been an unprecedented proliferation of sequencing data. One of the consequences is that it becomes extremely difficult to store, backup, and migrate enormous amount of genomic datasets, not to mention they continue to expand as the cost of sequencing decreases. Herein, a much more efficient and scalable program to perform genome compression is required urgently. In this manuscript, we propose a new Apache Spark based Genome Compression method called SparkGC that can run efficiently and cost-effectively on a scalable computational cluster to compress large collections of genomes. SparkGC uses Spark's in-memory computation capabilities to reduce compression time by keeping data active in memory between the first-order and second-order compression. The evaluation shows that the compression ratio of SparkGC is better than the best state-of-the-art methods, at least better by 30%. The compression speed is also at least 3.8 times that of the best state-of-the-art methods on only one worker node and scales quite well with the number of nodes. SparkGC is of significant benefit to genomic data storage and transmission. The source code of SparkGC is publicly available at https://github.com/haichangyao/SparkGC .
自本世纪初人类基因组计划完成以来,测序数据呈前所未有的激增。其结果之一是,存储、备份和迁移大量基因组数据集变得极其困难,更不用说随着测序成本的降低,这些数据集还在不断扩大。因此,迫切需要一种更高效、更具可扩展性的基因组压缩程序。在本文中,我们提出了一种新的基于 Apache Spark 的基因组压缩方法,称为 SparkGC,它可以在可扩展的计算集群上高效、经济地运行,以压缩大量基因组集合。SparkGC 利用 Spark 的内存计算能力,通过在一级和二级压缩之间将数据保持在内存中,从而减少压缩时间。评估表明,SparkGC 的压缩比优于最佳的现有方法,至少好 30%。在仅一个工作节点上,压缩速度也至少比最佳的现有方法快 3.8 倍,并且随着节点数量的增加而很好地扩展。SparkGC 对基因组数据的存储和传输具有重要意义。SparkGC 的源代码可在 https://github.com/haichangyao/SparkGC 上公开获取。