基于参考的最长匹配子串基因组压缩及其并行化考虑。

Reference-based genome compression using the longest matched substrings with parallelization consideration.

机构信息

School of Information, Yunnan University, KunMing, China.

Yunnan Physical Science and Sports Professional College, KunMing, China.

出版信息

BMC Bioinformatics. 2023 Sep 30;24(1):369. doi: 10.1186/s12859-023-05500-z.

DOI:10.1186/s12859-023-05500-z

PMID:37777730

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10544193/

Abstract

BACKGROUND

A large number of researchers have devoted to accelerating the speed of genome sequencing and reducing the cost of genome sequencing for decades, and they have made great strides in both areas, making it easier for researchers to study and analyze genome data. However, how to efficiently store and transmit the vast amount of genome data generated by high-throughput sequencing technologies has become a challenge for data compression researchers. Therefore, the research of genome data compression algorithms to facilitate the efficient representation of genome data has gradually attracted the attention of these researchers. Meanwhile, considering that the current computing devices have multiple cores, how to make full use of the advantages of the computing devices and improve the efficiency of parallel processing is also an important direction for designing genome compression algorithms.

RESULTS

We proposed an algorithm (LMSRGC) based on reference genome sequences, which uses the suffix array (SA) and the longest common prefix (LCP) array to find the longest matched substrings (LMS) for the compression of genome data in FASTA format. The proposed algorithm utilizes the characteristics of SA and the LCP array to select all appropriate LMSs between the genome sequence to be compressed and the reference genome sequence and then utilizes LMSs to compress the target genome sequence. To speed up the operation of the algorithm, we use GPUs to parallelize the construction of SA, while using multiple threads to parallelize the creation of the LCP array and the filtering of LMSs.

CONCLUSIONS

Experiment results demonstrate that our algorithm is competitive with the current state-of-the-art algorithms in compression ratio and compression time.

摘要

背景

数十年来，大量研究人员致力于提高基因组测序速度和降低基因组测序成本，在这两个领域都取得了重大进展，使研究人员更容易研究和分析基因组数据。然而，如何有效地存储和传输高通量测序技术生成的大量基因组数据，已成为数据压缩研究人员面临的挑战。因此，研究基因组数据压缩算法以促进基因组数据的高效表示逐渐引起了这些研究人员的关注。同时，考虑到当前的计算设备具有多个内核，如何充分利用计算设备的优势并提高并行处理的效率，也是设计基因组压缩算法的一个重要方向。

结果

我们提出了一种基于参考基因组序列的算法（LMSRGC），用于压缩 FASTA 格式的基因组数据。该算法使用后缀数组（SA）和最长公共前缀（LCP）数组来查找最长匹配子字符串（LMS），以实现对基因组数据的压缩。该算法利用 SA 和 LCP 数组的特点，选择压缩基因组序列和参考基因组序列之间的所有合适的 LMS，然后利用 LMS 来压缩目标基因组序列。为了加速算法的运行，我们使用 GPU 并行化 SA 的构建，同时使用多个线程并行化 LCP 数组的创建和 LMS 的过滤。

结论

实验结果表明，我们的算法在压缩比和压缩时间方面与当前最先进的算法具有竞争力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60ab/10544193/18eb4702768b/12859_2023_5500_Fig1_HTML.jpg

相似文献

Reference-based genome compression using the longest matched substrings with parallelization consideration.

BMC Bioinformatics. 2023 Sep 30;24(1):369. doi: 10.1186/s12859-023-05500-z.

High efficiency referential genome compression algorithm.

Bioinformatics. 2019 Jun 1;35(12):2058-2065. doi: 10.1093/bioinformatics/bty934.

ERGC: an efficient referential genome compression algorithm.

Bioinformatics. 2015 Nov 1;31(21):3468-75. doi: 10.1093/bioinformatics/btv399. Epub 2015 Jul 2.

Algorithms designed for compressed-gene-data transformation among gene banks with different references.

BMC Bioinformatics. 2018 Jun 18;19(1):230. doi: 10.1186/s12859-018-2230-2.

Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression.

Bioinformatics. 2019 Jun 1;35(12):2066-2074. doi: 10.1093/bioinformatics/bty936.

Modified HuffBit Compress Algorithm - An Application of R.

J Integr Bioinform. 2018 Feb 22;15(3):20170057. doi: 10.1515/jib-2017-0057.

SparkGC: Spark based genome compression for large collections of genomes.

BMC Bioinformatics. 2022 Jul 25;23(1):297. doi: 10.1186/s12859-022-04825-5.

A new algorithm for "the LCS problem" with application in compressing genome resequencing data.

BMC Genomics. 2016 Aug 18;17 Suppl 4(Suppl 4):544. doi: 10.1186/s12864-016-2793-0.

A randomized optimal k-mer indexing approach for efficient parallel genome sequence compression.

Gene. 2024 May 20;907:148235. doi: 10.1016/j.gene.2024.148235. Epub 2024 Feb 10.

High-speed and high-ratio referential genome compression.

Bioinformatics. 2017 Nov 1;33(21):3364-3372. doi: 10.1093/bioinformatics/btx412.

本文引用的文献

Allowing mutations in maximal matches boosts genome compression performance.

Bioinformatics. 2020 Sep 15;36(18):4675-4681. doi: 10.1093/bioinformatics/btaa572.

Fast detection of maximal exact matches via fixed sampling of query K-mers and Bloom filtering of index K-mers.

Bioinformatics. 2019 Nov 1;35(22):4560-4567. doi: 10.1093/bioinformatics/btz273.

High-speed and high-ratio referential genome compression.

Bioinformatics. 2017 Nov 1;33(21):3364-3372. doi: 10.1093/bioinformatics/btx412.

Comment on: 'ERGC: an efficient referential genome compression algorithm'.

Bioinformatics. 2016 Apr 1;32(7):1115-7. doi: 10.1093/bioinformatics/btv704. Epub 2015 Nov 28.

ERGC: an efficient referential genome compression algorithm.

Bioinformatics. 2015 Nov 1;31(21):3468-75. doi: 10.1093/bioinformatics/btv399. Epub 2015 Jul 2.

GDC 2: Compression of large collections of genomes.

Sci Rep. 2015 Jun 25;5:11565. doi: 10.1038/srep11565.

essaMEM: finding maximal exact matches using enhanced sparse suffix arrays.

Bioinformatics. 2013 Mar 15;29(6):802-4. doi: 10.1093/bioinformatics/btt042. Epub 2013 Jan 24.

A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays.

Bioinformatics. 2009 Jul 1;25(13):1609-16. doi: 10.1093/bioinformatics/btp275. Epub 2009 Apr 23.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于参考的最长匹配子串基因组压缩及其并行化考虑。

Reference-based genome compression using the longest matched substrings with parallelization consideration.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献