将FASTA文件转换为特征向量以对短读段数据库进行无监督压缩。

Transformation of FASTA files into feature vectors for unsupervised compression of short reads databases.

作者信息

Tang Tao, Li Jinyan

机构信息

Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Broadway, NSW 2007, Australia.

出版信息

J Bioinform Comput Biol. 2021 Feb;19(1):2050048. doi: 10.1142/S0219720020500481. Epub 2021 Jan 20.

DOI:10.1142/S0219720020500481

PMID:33472569

Abstract

FASTA data sets of short reads are usually generated in tens or hundreds for a biomedical study. However, current compression of these data sets is carried out one-by-one without consideration of the inter-similarity between the data sets which can be otherwise exploited to enhance compression performance of de novo compression. We show that clustering these data sets into similar sub-groups for a group-by-group compression can greatly improve the compression performance. Our novel idea is to detect the lexicographically smallest -mer (-minimizer) for every read in each data set, and uses these -mers as features and their frequencies in every data set as feature values to transform these huge data sets each into a characteristic feature vector. Unsupervised clustering algorithms are then applied to these vectors to find similar data sets and merge them. As the amount of common -mers of similar feature values between two data sets implies an excessive proportion of overlapping reads shared between the two data sets, merging similar data sets creates immense sequence redundancy to boost the compression performance. Experiments confirm that our clustering approach can gain up to 12% improvement over several state-of-the-art algorithms in compressing reads databases consisting of 17-100 data sets (48.57-197.97[Formula: see text]GB).

摘要

对于生物医学研究而言，短读段的FASTA数据集通常会生成数十个或数百个。然而，当前对这些数据集的压缩是逐个进行的，并未考虑数据集之间的相似性，而这些相似性原本可用于提高从头压缩的性能。我们表明，将这些数据集聚类为相似的子组以进行逐组压缩，可以极大地提高压缩性能。我们的新颖想法是为每个数据集中的每个读段检测字典序最小的k-mer（k-最小化器），并将这些k-mer用作特征，将它们在每个数据集中的频率用作特征值，从而将这些庞大的数据集各自转换为一个特征向量。然后，将无监督聚类算法应用于这些向量，以找到相似的数据集并将它们合并。由于两个数据集之间具有相似特征值的共同k-mer数量意味着这两个数据集中共享的重叠读段比例过高，合并相似的数据集会产生巨大的序列冗余，从而提高压缩性能。实验证实，在压缩由17 - 100个数据集（48.57 - 197.97[公式：见正文]GB）组成的读段数据库时，我们的聚类方法比几种最先进的算法最多可提高12%的性能。

相似文献

Transformation of FASTA files into feature vectors for unsupervised compression of short reads databases.将FASTA文件转换为特征向量以对短读段数据库进行无监督压缩。

J Bioinform Comput Biol. 2021 Feb;19(1):2050048. doi: 10.1142/S0219720020500481. Epub 2021 Jan 20.

Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression.通过 (w, k)-最小化子索引后缀-前缀重叠来生成用于读取压缩的长连续体。

Bioinformatics. 2019 Jun 1;35(12):2066-2074. doi: 10.1093/bioinformatics/bty936.

Simultaneous compression of multiple error-corrected short-read sets for faster data transmission and better de novo assemblies.同时压缩多个纠错后的短读段，以实现更快的数据传输和更好的从头组装。

Brief Funct Genomics. 2022 Sep 16;21(5):387-398. doi: 10.1093/bfgp/elac016.

Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression.基因组短读段的汉明移位图：高效构建及其在压缩中的应用

PLoS Comput Biol. 2021 Jul 19;17(7):e1009229. doi: 10.1371/journal.pcbi.1009229. eCollection 2021 Jul.

SCALCE: boosting sequence compression algorithms using locally consistent encoding.SCALCE：使用局部一致编码提升序列压缩算法。

Bioinformatics. 2012 Dec 1;28(23):3051-7. doi: 10.1093/bioinformatics/bts593. Epub 2012 Oct 9.

Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.基于哈希的重排序压缩基因组测序reads：算法与分析。

Bioinformatics. 2018 Feb 15;34(4):558-567. doi: 10.1093/bioinformatics/btx639.

An efficient classification algorithm for NGS data based on text similarity.一种基于文本相似度的高效NGS数据分类算法。

Genet Res (Camb). 2018 Sep 17;100:e8. doi: 10.1017/S0016672318000058.

HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads.HUGO：用于比对读段的分层多参考基因组压缩。

J Am Med Inform Assoc. 2014 Mar-Apr;21(2):363-73. doi: 10.1136/amiajnl-2013-002147. Epub 2013 Dec 24.

A randomized optimal k-mer indexing approach for efficient parallel genome sequence compression.一种用于高效并行基因组序列压缩的随机最优 k-mer 索引方法。

Gene. 2024 May 20;907:148235. doi: 10.1016/j.gene.2024.148235. Epub 2024 Feb 10.

Indexing -mers in linear space for quality value compression.用于质量值压缩的线性空间中的索引k-mer。

J Bioinform Comput Biol. 2019 Oct;17(5):1940011. doi: 10.1142/S0219720019400110.

引用本文的文献

OReO: optimizing read order for practical compression.OReO：优化实际压缩的读取顺序

Bioinform Adv. 2025 Jun 3;5(1):vbaf128. doi: 10.1093/bioadv/vbaf128. eCollection 2025.

Transcriptome and Metabolome Integration Reveals the Impact of Fungal Elicitors on Triterpene Accumulation in .转录组与代谢组整合揭示真菌激发子对[具体植物名称]中三萜积累的影响。（原文中“in.”后面缺少具体植物名称）

J Fungi (Basel). 2023 May 24;9(6):604. doi: 10.3390/jof9060604.

SCA-NGS: Secure compression algorithm for next generation sequencing data using genetic operators and block sorting.SCA-NGS：使用遗传算子和块排序的下一代测序数据的安全压缩算法。

Sci Prog. 2021 Apr-Jun;104(2):368504211023276. doi: 10.1177/00368504211023276.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

将FASTA文件转换为特征向量以对短读段数据库进行无监督压缩。

Transformation of FASTA files into feature vectors for unsupervised compression of short reads databases.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献