IEEE/ACM Trans Comput Biol Bioinform. 2019 Jan-Feb;16(1):208-221. doi: 10.1109/TCBB.2017.2762302. Epub 2017 Oct 12.
Due to the advancement of DNA sequencing techniques, the number of sequenced individual genomes has experienced an exponential growth. Thus, effective compression of this kind of sequences is highly desired. In this work, we present a novel compression algorithm called Reference-based Compression algorithm using the concept of Clustering (RCC). The rationale behind RCC is based on the observation about the existence of substructures within the population sequences. To utilize these substructures, k-means clustering is employed to partition sequences into clusters for better compression. A reference sequence is then constructed for each cluster so that sequences in that cluster can be compressed by referring to this reference sequence. The reference sequence of each cluster is also compressed with reference to a sequence which is derived from all the reference sequences. Experiments show that RCC can further reduce the compressed size by up to 91.0 percent when compared with state-of-the-art compression approaches. There is a compromise between compressed size and processing time. The current implementation in Matlab has time complexity in a factor of thousands higher than the existing algorithms implemented in C/C++. Further investigation is required to improve processing time in future.
由于 DNA 测序技术的进步,测序个体基因组的数量呈指数级增长。因此,非常需要有效地压缩这种序列。在这项工作中,我们提出了一种新的压缩算法,称为基于参考的聚类压缩算法(RCC)。RCC 的基本原理是基于对群体序列中存在子结构的观察。为了利用这些子结构,使用 k-均值聚类将序列划分为簇,以实现更好的压缩。然后为每个簇构建一个参考序列,以便可以通过引用该参考序列来压缩该簇中的序列。还使用源自所有参考序列的序列来压缩每个簇的参考序列。实验表明,与最先进的压缩方法相比,RCC 可以将压缩后的大小进一步减少 91.0%。在压缩大小和处理时间之间存在折衷。目前在 Matlab 中的实现的时间复杂度比用 C/C++ 实现的现有算法高几千倍。需要进一步研究以提高未来的处理时间。