• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种应用于压缩基因组重测序数据的“最长公共子序列问题”新算法。

A new algorithm for "the LCS problem" with application in compressing genome resequencing data.

作者信息

Beal Richard, Afrin Tazin, Farheen Aliya, Adjeroh Donald

机构信息

Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV, USA.

出版信息

BMC Genomics. 2016 Aug 18;17 Suppl 4(Suppl 4):544. doi: 10.1186/s12864-016-2793-0.

DOI:10.1186/s12864-016-2793-0
PMID:27556803
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5001248/
Abstract

BACKGROUND

The longest common subsequence (LCS) problem is a classical problem in computer science, and forms the basis of the current best-performing reference-based compression schemes for genome resequencing data.

METHODS

First, we present a new algorithm for the LCS problem. Using the generalized suffix tree, we identify the common substrings shared between the two input sequences. Using the maximal common substrings, we construct a directed acyclic graph (DAG), based on which we determine the LCS as the longest path in the DAG. Then, we introduce an LCS-motivated reference-based compression scheme using the components of the LCS, rather than the LCS itself.

RESULTS

Our basic scheme compressed the Homo sapiens genome (with an original size of 3,080,436,051 bytes) to 15,460,478 bytes. An improvement on the basic method further reduced this to 8,556,708 bytes, or an overall compression ratio of 360. This can be compared to the previous state-of-the-art compression ratios of 157 (Wang and Zhang, 2011) and 171 (Pinho, Pratas, and Garcia, 2011).

CONCLUSION

We propose a new algorithm to address the longest common subsequence problem. Motivated by our LCS algorithm, we introduce a new reference-based compression scheme for genome resequencing data. Comparative results against state-of-the-art reference-based compression algorithms demonstrate the performance of the proposed method.

摘要

背景

最长公共子序列(LCS)问题是计算机科学中的一个经典问题,也是当前用于基因组重测序数据的最佳参考压缩方案的基础。

方法

首先,我们提出了一种解决LCS问题的新算法。利用广义后缀树,我们识别两个输入序列之间共享的公共子串。利用最大公共子串,我们构建一个有向无环图(DAG),并基于此将LCS确定为DAG中的最长路径。然后,我们引入一种基于LCS的参考压缩方案,该方案使用LCS的组成部分,而不是LCS本身。

结果

我们的基本方案将人类基因组(原始大小为3,080,436,051字节)压缩到15,460,478字节。对基本方法的改进进一步将其减少到8,556,708字节,总体压缩率为360。相比之下,之前的最先进压缩率分别为157(Wang和Zhang,2011)和171(Pinho、Pratas和Garcia,2011)。

结论

我们提出了一种新算法来解决最长公共子序列问题。受我们的LCS算法启发,我们引入了一种用于基因组重测序数据的新的基于参考的压缩方案。与最先进的基于参考的压缩算法的比较结果证明了所提方法的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e9a/5001248/d063e0cd5003/12864_2016_2793_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e9a/5001248/2c244ff0542b/12864_2016_2793_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e9a/5001248/d1b55a25172f/12864_2016_2793_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e9a/5001248/af5406b970e8/12864_2016_2793_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e9a/5001248/ac0873b39607/12864_2016_2793_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e9a/5001248/d063e0cd5003/12864_2016_2793_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e9a/5001248/2c244ff0542b/12864_2016_2793_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e9a/5001248/d1b55a25172f/12864_2016_2793_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e9a/5001248/af5406b970e8/12864_2016_2793_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e9a/5001248/ac0873b39607/12864_2016_2793_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0e9a/5001248/d063e0cd5003/12864_2016_2793_Fig5_HTML.jpg

相似文献

1
A new algorithm for "the LCS problem" with application in compressing genome resequencing data.一种应用于压缩基因组重测序数据的“最长公共子序列问题”新算法。
BMC Genomics. 2016 Aug 18;17 Suppl 4(Suppl 4):544. doi: 10.1186/s12864-016-2793-0.
2
An OpenMP-based tool for finding longest common subsequence in bioinformatics.一种基于OpenMP的生物信息学中查找最长公共子序列的工具。
BMC Res Notes. 2019 Apr 11;12(1):220. doi: 10.1186/s13104-019-4256-6.
3
Reference-based genome compression using the longest matched substrings with parallelization consideration.基于参考的最长匹配子串基因组压缩及其并行化考虑。
BMC Bioinformatics. 2023 Sep 30;24(1):369. doi: 10.1186/s12859-023-05500-z.
4
Efficient Computation of Longest Common Subsequences with Multiple Substring Inclusive Constraints.
J Comput Biol. 2019 Sep;26(9):938-947. doi: 10.1089/cmb.2019.0008. Epub 2019 Apr 8.
5
Deposition and extension approach to find longest common subsequence for thousands of long sequences.沉积和扩展方法来寻找数千个长序列的最长公共子序列。
Comput Biol Chem. 2010 Jun;34(3):149-57. doi: 10.1016/j.compbiolchem.2010.05.001. Epub 2010 May 11.
6
A fast and memory efficient MLCS algorithm by character merging for DNA sequences alignment.一种基于字符合并的快速且节省内存的 DNA 序列比对 MLCS 算法。
Bioinformatics. 2020 Feb 15;36(4):1066-1073. doi: 10.1093/bioinformatics/btz725.
7
Algorithmic height compression of unordered trees.无序树的算法高度压缩
J Theor Biol. 2016 Jan 21;389:237-52. doi: 10.1016/j.jtbi.2015.10.030. Epub 2015 Nov 6.
8
Multiple genome alignment based on longest path in directed acyclic graphs.基于有向无环图中最长路径的多基因组比对。
Int J Bioinform Res Appl. 2010;6(4):366-83. doi: 10.1504/IJBRA.2010.036.
9
Exemplar longest common subsequence.示例最长公共子序列。
IEEE/ACM Trans Comput Biol Bioinform. 2007 Oct-Dec;4(4):535-43. doi: 10.1109/TCBB.2007.1066.
10
Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform.利用压缩后缀树和布罗伊登-弗莱彻-戈德法布-香农(BFGS)变换进行图形化泛基因组分析。 (注:原文中未提及“布罗伊登-弗莱彻-戈德法布-香农(BFGS)变换”,这里按照常规理解将Burrows-Wheeler transform翻译为布罗伊登-弗莱彻-戈德法布-香农变换,你可根据实际情况进行调整,因为它可能是特定领域有固定译法的专业术语,也有可能是输入有误,如果实际是Burrows-Wheeler transform应翻译为“伯罗斯-惠勒变换” ) 修正后的译文:利用压缩后缀树和伯罗斯-惠勒变换进行图形化泛基因组分析。
Bioinformatics. 2016 Feb 15;32(4):497-504. doi: 10.1093/bioinformatics/btv603. Epub 2015 Oct 26.

引用本文的文献

1
High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach.用于SARS-CoV-2 RNA聚类的高性能计算:一种基于数据科学的基因组学方法。
Genomics Inform. 2021 Dec;19(4):e49. doi: 10.5808/gi.21056. Epub 2021 Dec 31.
2
Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review.用于组装基因组的垂直无损基因组数据压缩工具:系统文献回顾。
PLoS One. 2020 May 26;15(5):e0232942. doi: 10.1371/journal.pone.0232942. eCollection 2020.
3
Efficient algorithms for Longest Common Subsequence of two bucket orders to speed up pairwise genetic map comparison.

本文引用的文献

1
A Space-Bounded Anytime Algorithm for the Multiple Longest Common Subsequence Problem.一种针对多重最长公共子序列问题的空间受限随时算法。
IEEE Trans Knowl Data Eng. 2014 Nov;26(11):2599-2609. doi: 10.1109/TKDE.2014.2304464.
2
FRESCO: Referential compression of highly similar sequences.FRESCO:高度相似序列的参考压缩
IEEE/ACM Trans Comput Biol Bioinform. 2013 Sep-Oct;10(5):1275-88. doi: 10.1109/tcbb.2013.122.
3
SCALCE: boosting sequence compression algorithms using locally consistent encoding.SCALCE:使用局部一致编码提升序列压缩算法。
用于加速两两遗传图谱比较的两个桶序最长公共子序列的高效算法。
PLoS One. 2018 Dec 27;13(12):e0208838. doi: 10.1371/journal.pone.0208838. eCollection 2018.
4
K2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics.K2 和 K2*:基于 Kendall 统计量的高效无对齐序列相似性度量。
Bioinformatics. 2018 May 15;34(10):1682-1689. doi: 10.1093/bioinformatics/btx809.
Bioinformatics. 2012 Dec 1;28(23):3051-7. doi: 10.1093/bioinformatics/bts593. Epub 2012 Oct 9.
4
Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform.利用布劳尔-惠勒变换对基因组序列数据库进行大规模压缩。
Bioinformatics. 2012 Jun 1;28(11):1415-9. doi: 10.1093/bioinformatics/bts173. Epub 2012 May 3.
5
GReEn: a tool for efficient compression of genome resequencing data.GReEn:一种用于高效压缩基因组重测序数据的工具。
Nucleic Acids Res. 2012 Feb;40(4):e27. doi: 10.1093/nar/gkr1124. Epub 2011 Dec 1.
6
A novel compression tool for efficient storage of genome resequencing data.一种用于高效存储基因组重测序数据的新型压缩工具。
Nucleic Acids Res. 2011 Apr;39(7):e45. doi: 10.1093/nar/gkr009. Epub 2011 Jan 25.
7
Efficient storage of high throughput DNA sequencing data using reference-based compression.利用基于参考的压缩技术高效存储高通量 DNA 测序数据。
Genome Res. 2011 May;21(5):734-40. doi: 10.1101/gr.114819.110. Epub 2011 Jan 18.
8
Textual data compression in computational biology: a synopsis.计算生物学中的文本数据压缩:概述。
Bioinformatics. 2009 Jul 1;25(13):1575-86. doi: 10.1093/bioinformatics/btp117. Epub 2009 Feb 27.
9
Computational comparison of two draft sequences of the human genome.人类基因组两个草图序列的计算比较。
Nature. 2001 Feb 15;409(6822):856-9. doi: 10.1038/35057055.
10
Identification of common molecular subsequences.常见分子子序列的鉴定
J Mol Biol. 1981 Mar 25;147(1):195-7. doi: 10.1016/0022-2836(81)90087-5.