• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于参考的最长匹配子串基因组压缩及其并行化考虑。

Reference-based genome compression using the longest matched substrings with parallelization consideration.

机构信息

School of Information, Yunnan University, KunMing, China.

Yunnan Physical Science and Sports Professional College, KunMing, China.

出版信息

BMC Bioinformatics. 2023 Sep 30;24(1):369. doi: 10.1186/s12859-023-05500-z.

DOI:10.1186/s12859-023-05500-z
PMID:37777730
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10544193/
Abstract

BACKGROUND

A large number of researchers have devoted to accelerating the speed of genome sequencing and reducing the cost of genome sequencing for decades, and they have made great strides in both areas, making it easier for researchers to study and analyze genome data. However, how to efficiently store and transmit the vast amount of genome data generated by high-throughput sequencing technologies has become a challenge for data compression researchers. Therefore, the research of genome data compression algorithms to facilitate the efficient representation of genome data has gradually attracted the attention of these researchers. Meanwhile, considering that the current computing devices have multiple cores, how to make full use of the advantages of the computing devices and improve the efficiency of parallel processing is also an important direction for designing genome compression algorithms.

RESULTS

We proposed an algorithm (LMSRGC) based on reference genome sequences, which uses the suffix array (SA) and the longest common prefix (LCP) array to find the longest matched substrings (LMS) for the compression of genome data in FASTA format. The proposed algorithm utilizes the characteristics of SA and the LCP array to select all appropriate LMSs between the genome sequence to be compressed and the reference genome sequence and then utilizes LMSs to compress the target genome sequence. To speed up the operation of the algorithm, we use GPUs to parallelize the construction of SA, while using multiple threads to parallelize the creation of the LCP array and the filtering of LMSs.

CONCLUSIONS

Experiment results demonstrate that our algorithm is competitive with the current state-of-the-art algorithms in compression ratio and compression time.

摘要

背景

数十年来,大量研究人员致力于提高基因组测序速度和降低基因组测序成本,在这两个领域都取得了重大进展,使研究人员更容易研究和分析基因组数据。然而,如何有效地存储和传输高通量测序技术生成的大量基因组数据,已成为数据压缩研究人员面临的挑战。因此,研究基因组数据压缩算法以促进基因组数据的高效表示逐渐引起了这些研究人员的关注。同时,考虑到当前的计算设备具有多个内核,如何充分利用计算设备的优势并提高并行处理的效率,也是设计基因组压缩算法的一个重要方向。

结果

我们提出了一种基于参考基因组序列的算法(LMSRGC),用于压缩 FASTA 格式的基因组数据。该算法使用后缀数组(SA)和最长公共前缀(LCP)数组来查找最长匹配子字符串(LMS),以实现对基因组数据的压缩。该算法利用 SA 和 LCP 数组的特点,选择压缩基因组序列和参考基因组序列之间的所有合适的 LMS,然后利用 LMS 来压缩目标基因组序列。为了加速算法的运行,我们使用 GPU 并行化 SA 的构建,同时使用多个线程并行化 LCP 数组的创建和 LMS 的过滤。

结论

实验结果表明,我们的算法在压缩比和压缩时间方面与当前最先进的算法具有竞争力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60ab/10544193/eb3f1af7d7de/12859_2023_5500_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60ab/10544193/18eb4702768b/12859_2023_5500_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60ab/10544193/d1f98a9c6520/12859_2023_5500_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60ab/10544193/6583b3e49909/12859_2023_5500_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60ab/10544193/e001001a1bc6/12859_2023_5500_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60ab/10544193/eb3f1af7d7de/12859_2023_5500_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60ab/10544193/18eb4702768b/12859_2023_5500_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60ab/10544193/d1f98a9c6520/12859_2023_5500_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60ab/10544193/6583b3e49909/12859_2023_5500_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60ab/10544193/e001001a1bc6/12859_2023_5500_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60ab/10544193/eb3f1af7d7de/12859_2023_5500_Fig5_HTML.jpg

相似文献

1
Reference-based genome compression using the longest matched substrings with parallelization consideration.基于参考的最长匹配子串基因组压缩及其并行化考虑。
BMC Bioinformatics. 2023 Sep 30;24(1):369. doi: 10.1186/s12859-023-05500-z.
2
High efficiency referential genome compression algorithm.高效引用基因组压缩算法。
Bioinformatics. 2019 Jun 1;35(12):2058-2065. doi: 10.1093/bioinformatics/bty934.
3
ERGC: an efficient referential genome compression algorithm.ERGC:一种高效的参考基因组压缩算法。
Bioinformatics. 2015 Nov 1;31(21):3468-75. doi: 10.1093/bioinformatics/btv399. Epub 2015 Jul 2.
4
Algorithms designed for compressed-gene-data transformation among gene banks with different references.用于在具有不同参照的基因库之间进行压缩基因数据转换的算法。
BMC Bioinformatics. 2018 Jun 18;19(1):230. doi: 10.1186/s12859-018-2230-2.
5
Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression.通过 (w, k)-最小化子索引后缀-前缀重叠来生成用于读取压缩的长连续体。
Bioinformatics. 2019 Jun 1;35(12):2066-2074. doi: 10.1093/bioinformatics/bty936.
6
Modified HuffBit Compress Algorithm - An Application of R.改进的哈夫比特压缩算法 - R的一种应用
J Integr Bioinform. 2018 Feb 22;15(3):20170057. doi: 10.1515/jib-2017-0057.
7
SparkGC: Spark based genome compression for large collections of genomes.SparkGC:基于 Spark 的基因组压缩方法,适用于大规模基因组集合。
BMC Bioinformatics. 2022 Jul 25;23(1):297. doi: 10.1186/s12859-022-04825-5.
8
A new algorithm for "the LCS problem" with application in compressing genome resequencing data.一种应用于压缩基因组重测序数据的“最长公共子序列问题”新算法。
BMC Genomics. 2016 Aug 18;17 Suppl 4(Suppl 4):544. doi: 10.1186/s12864-016-2793-0.
9
A randomized optimal k-mer indexing approach for efficient parallel genome sequence compression.一种用于高效并行基因组序列压缩的随机最优 k-mer 索引方法。
Gene. 2024 May 20;907:148235. doi: 10.1016/j.gene.2024.148235. Epub 2024 Feb 10.
10
High-speed and high-ratio referential genome compression.高速高比参照基因组压缩。
Bioinformatics. 2017 Nov 1;33(21):3364-3372. doi: 10.1093/bioinformatics/btx412.

本文引用的文献

1
Allowing mutations in maximal matches boosts genome compression performance.允许最大匹配中的突变可以提高基因组压缩性能。
Bioinformatics. 2020 Sep 15;36(18):4675-4681. doi: 10.1093/bioinformatics/btaa572.
2
Fast detection of maximal exact matches via fixed sampling of query K-mers and Bloom filtering of index K-mers.通过查询 K -mer 的固定采样和索引 K-mer 的布隆过滤实现最大精确匹配的快速检测。
Bioinformatics. 2019 Nov 1;35(22):4560-4567. doi: 10.1093/bioinformatics/btz273.
3
High-speed and high-ratio referential genome compression.
高速高比参照基因组压缩。
Bioinformatics. 2017 Nov 1;33(21):3364-3372. doi: 10.1093/bioinformatics/btx412.
4
Comment on: 'ERGC: an efficient referential genome compression algorithm'.关于《ERGC:一种高效的参考基因组压缩算法》的评论
Bioinformatics. 2016 Apr 1;32(7):1115-7. doi: 10.1093/bioinformatics/btv704. Epub 2015 Nov 28.
5
ERGC: an efficient referential genome compression algorithm.ERGC:一种高效的参考基因组压缩算法。
Bioinformatics. 2015 Nov 1;31(21):3468-75. doi: 10.1093/bioinformatics/btv399. Epub 2015 Jul 2.
6
GDC 2: Compression of large collections of genomes.基因组数据压缩2:大型基因组集合的压缩
Sci Rep. 2015 Jun 25;5:11565. doi: 10.1038/srep11565.
7
essaMEM: finding maximal exact matches using enhanced sparse suffix arrays.essaMEM:使用增强型稀疏后缀数组查找最大精确匹配。
Bioinformatics. 2013 Mar 15;29(6):802-4. doi: 10.1093/bioinformatics/btt042. Epub 2013 Jan 24.
8
A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays.一种使用稀疏后缀数组在大型序列数据集中查找最大精确匹配的实用算法。
Bioinformatics. 2009 Jul 1;25(13):1609-16. doi: 10.1093/bioinformatics/btp275. Epub 2009 Apr 23.