Ding Youde, Liao Yuan, He Ji, Ma Jianfeng, Wei Xu, Liu Xuemei, Zhang Guiying, Wang Jing
The Sixth Affiliated Hospital of Guangzhou Medical University, Qingyuan People's Hospital, Qingyuan, China.
School of Biomedical Engineering, Guangzhou Medical University, Guangzhou, China.
Front Genet. 2023 Jun 1;14:1213907. doi: 10.3389/fgene.2023.1213907. eCollection 2023.
With the rapid development of high-throughput sequencing technology and the explosive growth of genomic data, storing, transmitting and processing massive amounts of data has become a new challenge. How to achieve fast lossless compression and decompression according to the characteristics of the data to speed up data transmission and processing requires research on relevant compression algorithms. In this paper, a compression algorithm for sparse asymmetric gene mutations (CA_SAGM) based on the characteristics of sparse genomic mutation data was proposed. The data was first sorted on a row-first basis so that neighboring non-zero elements were as close as possible to each other. The data were then renumbered using the reverse Cuthill-Mckee sorting technique. Finally the data were compressed into sparse row format (CSR) and stored. We had analyzed and compared the results of the CA_SAGM, coordinate format (COO) and compressed sparse column format (CSC) algorithms for sparse asymmetric genomic data. Nine types of single-nucleotide variation (SNV) data and six types of copy number variation (CNV) data from the TCGA database were used as the subjects of this study. Compression and decompression time, compression and decompression rate, compression memory and compression ratio were used as evaluation metrics. The correlation between each metric and the basic characteristics of the original data was further investigated. The experimental results showed that the COO method had the shortest compression time, the fastest compression rate and the largest compression ratio, and had the best compression performance. CSC compression performance was the worst, and CA_SAGM compression performance was between the two. When decompressing the data, CA_SAGM performed the best, with the shortest decompression time and the fastest decompression rate. COO decompression performance was the worst. With increasing sparsity, the COO, CSC and CA_SAGM algorithms all exhibited longer compression and decompression times, lower compression and decompression rates, larger compression memory and lower compression ratios. When the sparsity was large, the compression memory and compression ratio of the three algorithms showed no difference characteristics, but the rest of the indexes were still different. CA_SAGM was an efficient compression algorithm that combines compression and decompression performance for sparse genomic mutation data.
随着高通量测序技术的快速发展以及基因组数据的爆炸式增长,存储、传输和处理海量数据已成为一项新挑战。如何根据数据特征实现快速无损压缩和解压缩以加速数据传输和处理,需要对相关压缩算法进行研究。本文提出了一种基于稀疏基因组突变数据特征的稀疏不对称基因突变压缩算法(CA_SAGM)。首先按行优先对数据进行排序,以使相邻的非零元素尽可能彼此靠近。然后使用逆Cuthill-Mckee排序技术对数据重新编号。最后将数据压缩为稀疏行格式(CSR)并存储。我们对CA_SAGM、坐标格式(COO)和压缩稀疏列格式(CSC)算法处理稀疏不对称基因组数据的结果进行了分析和比较。使用来自TCGA数据库的九种单核苷酸变异(SNV)数据和六种拷贝数变异(CNV)数据作为本研究的对象。将压缩和解压缩时间、压缩和解压缩率、压缩内存和压缩比用作评估指标。进一步研究了每个指标与原始数据基本特征之间的相关性。实验结果表明,COO方法的压缩时间最短,压缩率最快且压缩比最大,具有最佳的压缩性能。CSC的压缩性能最差,而CA_SAGM的压缩性能介于两者之间。在解压缩数据时,CA_SAGM表现最佳,解压缩时间最短且解压缩率最快。COO的解压缩性能最差。随着稀疏度的增加,COO、CSC和CA_SAGM算法的压缩和解压缩时间都变长,压缩和解压缩率降低,压缩内存变大且压缩比降低。当稀疏度较大时,三种算法的压缩内存和压缩比没有差异特征,但其余指标仍存在差异。CA_SAGM是一种针对稀疏基因组突变数据结合了压缩和解压缩性能的高效压缩算法。