Electrical and Information Engineering College, JiLin Agricultural Science and Technology University, Jilin, China.
College of Electrical and Information, Northeast Agricultural University, Harbin, China.
PLoS One. 2018 Nov 5;13(11):e0206521. doi: 10.1371/journal.pone.0206521. eCollection 2018.
The massive quantities of genetic data generated by high-throughput sequencing pose challenges to data storage, transmission and analyses. These problems are effectively solved through data compression, in which the size of data storage is reduced and the speed of data transmission is improved. Several options are available for compressing and storing genetic data. However, most of these options either do not provide sufficient compression rates or require a considerable length of time for decompression and loading.
Here, we propose TRCMGene, a lossless genetic data compression method that uses a referential compression scheme. The novel concept of two-step compression method, which builds an index structure using K-means and k-nearest neighbours, is introduced to TRCMGene. Evaluation with several real datasets revealed that the compression factor of TRCMGene ranges from 9 to 21. TRCMGene presents a good balance between compression factor and reading time. On average, the reading time of compressed data is 60% of that of uncompressed data. Thus, TRCMGene not only saves disc space but also saves file access time and speeds up data loading. These effects collectively improve genetic data storage and transmission in the current hardware environment and render system upgrades unnecessary. TRCMGene, user manual and demos could be accessed freely from https://github.com/tangyou79/TRCM. The data mentioned in this manuscript could be downloaded from: https://github.com/tangyou79/TRCM/wiki.
高通量测序产生的大量遗传数据给数据存储、传输和分析带来了挑战。通过数据压缩可以有效地解决这些问题,数据压缩可以减小数据存储的大小并提高数据传输的速度。有几种选择可用于压缩和存储遗传数据。然而,这些选择中的大多数要么不能提供足够的压缩率,要么需要相当长的时间来解压缩和加载。
在这里,我们提出了一种使用参照压缩方案的无损遗传数据压缩方法 TRCMGene。我们引入了两步压缩方法的新概念,该方法使用 K-均值和 K-最近邻构建索引结构。用几个真实数据集进行评估表明,TRCMGene 的压缩因子范围为 9 到 21。TRCMGene 在压缩因子和读取时间之间取得了很好的平衡。平均而言,压缩数据的读取时间是未压缩数据的 60%。因此,TRCMGene 不仅节省了磁盘空间,还节省了文件访问时间并加快了数据加载速度。这些效果共同改善了当前硬件环境中的遗传数据存储和传输,无需进行系统升级。可以从 https://github.com/tangyou79/TRCM 免费访问 TRCMGene、用户手册和演示。本文提到的数据可以从 https://github.com/tangyou79/TRCM/wiki 下载。