Tang Tao, Li Jinyan
Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Broadway, NSW 2007, Australia.
J Bioinform Comput Biol. 2021 Feb;19(1):2050048. doi: 10.1142/S0219720020500481. Epub 2021 Jan 20.
FASTA data sets of short reads are usually generated in tens or hundreds for a biomedical study. However, current compression of these data sets is carried out one-by-one without consideration of the inter-similarity between the data sets which can be otherwise exploited to enhance compression performance of de novo compression. We show that clustering these data sets into similar sub-groups for a group-by-group compression can greatly improve the compression performance. Our novel idea is to detect the lexicographically smallest -mer (-minimizer) for every read in each data set, and uses these -mers as features and their frequencies in every data set as feature values to transform these huge data sets each into a characteristic feature vector. Unsupervised clustering algorithms are then applied to these vectors to find similar data sets and merge them. As the amount of common -mers of similar feature values between two data sets implies an excessive proportion of overlapping reads shared between the two data sets, merging similar data sets creates immense sequence redundancy to boost the compression performance. Experiments confirm that our clustering approach can gain up to 12% improvement over several state-of-the-art algorithms in compressing reads databases consisting of 17-100 data sets (48.57-197.97[Formula: see text]GB).
对于生物医学研究而言,短读段的FASTA数据集通常会生成数十个或数百个。然而,当前对这些数据集的压缩是逐个进行的,并未考虑数据集之间的相似性,而这些相似性原本可用于提高从头压缩的性能。我们表明,将这些数据集聚类为相似的子组以进行逐组压缩,可以极大地提高压缩性能。我们的新颖想法是为每个数据集中的每个读段检测字典序最小的k-mer(k-最小化器),并将这些k-mer用作特征,将它们在每个数据集中的频率用作特征值,从而将这些庞大的数据集各自转换为一个特征向量。然后,将无监督聚类算法应用于这些向量,以找到相似的数据集并将它们合并。由于两个数据集之间具有相似特征值的共同k-mer数量意味着这两个数据集中共享的重叠读段比例过高,合并相似的数据集会产生巨大的序列冗余,从而提高压缩性能。实验证实,在压缩由17 - 100个数据集(48.57 - 197.97[公式:见正文]GB)组成的读段数据库时,我们的聚类方法比几种最先进的算法最多可提高12%的性能。