Suppr超能文献

基于参考的最长匹配子串基因组压缩及其并行化考虑。

Reference-based genome compression using the longest matched substrings with parallelization consideration.

机构信息

School of Information, Yunnan University, KunMing, China.

Yunnan Physical Science and Sports Professional College, KunMing, China.

出版信息

BMC Bioinformatics. 2023 Sep 30;24(1):369. doi: 10.1186/s12859-023-05500-z.

Abstract

BACKGROUND

A large number of researchers have devoted to accelerating the speed of genome sequencing and reducing the cost of genome sequencing for decades, and they have made great strides in both areas, making it easier for researchers to study and analyze genome data. However, how to efficiently store and transmit the vast amount of genome data generated by high-throughput sequencing technologies has become a challenge for data compression researchers. Therefore, the research of genome data compression algorithms to facilitate the efficient representation of genome data has gradually attracted the attention of these researchers. Meanwhile, considering that the current computing devices have multiple cores, how to make full use of the advantages of the computing devices and improve the efficiency of parallel processing is also an important direction for designing genome compression algorithms.

RESULTS

We proposed an algorithm (LMSRGC) based on reference genome sequences, which uses the suffix array (SA) and the longest common prefix (LCP) array to find the longest matched substrings (LMS) for the compression of genome data in FASTA format. The proposed algorithm utilizes the characteristics of SA and the LCP array to select all appropriate LMSs between the genome sequence to be compressed and the reference genome sequence and then utilizes LMSs to compress the target genome sequence. To speed up the operation of the algorithm, we use GPUs to parallelize the construction of SA, while using multiple threads to parallelize the creation of the LCP array and the filtering of LMSs.

CONCLUSIONS

Experiment results demonstrate that our algorithm is competitive with the current state-of-the-art algorithms in compression ratio and compression time.

摘要

背景

数十年来,大量研究人员致力于提高基因组测序速度和降低基因组测序成本,在这两个领域都取得了重大进展,使研究人员更容易研究和分析基因组数据。然而,如何有效地存储和传输高通量测序技术生成的大量基因组数据,已成为数据压缩研究人员面临的挑战。因此,研究基因组数据压缩算法以促进基因组数据的高效表示逐渐引起了这些研究人员的关注。同时,考虑到当前的计算设备具有多个内核,如何充分利用计算设备的优势并提高并行处理的效率,也是设计基因组压缩算法的一个重要方向。

结果

我们提出了一种基于参考基因组序列的算法(LMSRGC),用于压缩 FASTA 格式的基因组数据。该算法使用后缀数组(SA)和最长公共前缀(LCP)数组来查找最长匹配子字符串(LMS),以实现对基因组数据的压缩。该算法利用 SA 和 LCP 数组的特点,选择压缩基因组序列和参考基因组序列之间的所有合适的 LMS,然后利用 LMS 来压缩目标基因组序列。为了加速算法的运行,我们使用 GPU 并行化 SA 的构建,同时使用多个线程并行化 LCP 数组的创建和 LMS 的过滤。

结论

实验结果表明,我们的算法在压缩比和压缩时间方面与当前最先进的算法具有竞争力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60ab/10544193/18eb4702768b/12859_2023_5500_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验