• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于聚类的群体 DNA 序列压缩。

Clustering-Based Compression for Population DNA Sequences.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2019 Jan-Feb;16(1):208-221. doi: 10.1109/TCBB.2017.2762302. Epub 2017 Oct 12.

DOI:10.1109/TCBB.2017.2762302
PMID:29028207
Abstract

Due to the advancement of DNA sequencing techniques, the number of sequenced individual genomes has experienced an exponential growth. Thus, effective compression of this kind of sequences is highly desired. In this work, we present a novel compression algorithm called Reference-based Compression algorithm using the concept of Clustering (RCC). The rationale behind RCC is based on the observation about the existence of substructures within the population sequences. To utilize these substructures, k-means clustering is employed to partition sequences into clusters for better compression. A reference sequence is then constructed for each cluster so that sequences in that cluster can be compressed by referring to this reference sequence. The reference sequence of each cluster is also compressed with reference to a sequence which is derived from all the reference sequences. Experiments show that RCC can further reduce the compressed size by up to 91.0 percent when compared with state-of-the-art compression approaches. There is a compromise between compressed size and processing time. The current implementation in Matlab has time complexity in a factor of thousands higher than the existing algorithms implemented in C/C++. Further investigation is required to improve processing time in future.

摘要

由于 DNA 测序技术的进步,测序个体基因组的数量呈指数级增长。因此,非常需要有效地压缩这种序列。在这项工作中,我们提出了一种新的压缩算法,称为基于参考的聚类压缩算法(RCC)。RCC 的基本原理是基于对群体序列中存在子结构的观察。为了利用这些子结构,使用 k-均值聚类将序列划分为簇,以实现更好的压缩。然后为每个簇构建一个参考序列,以便可以通过引用该参考序列来压缩该簇中的序列。还使用源自所有参考序列的序列来压缩每个簇的参考序列。实验表明,与最先进的压缩方法相比,RCC 可以将压缩后的大小进一步减少 91.0%。在压缩大小和处理时间之间存在折衷。目前在 Matlab 中的实现的时间复杂度比用 C/C++ 实现的现有算法高几千倍。需要进一步研究以提高未来的处理时间。

相似文献

1
Clustering-Based Compression for Population DNA Sequences.基于聚类的群体 DNA 序列压缩。
IEEE/ACM Trans Comput Biol Bioinform. 2019 Jan-Feb;16(1):208-221. doi: 10.1109/TCBB.2017.2762302. Epub 2017 Oct 12.
2
Sketch distance-based clustering of chromosomes for large genome database compression.基于草图距离的染色体聚类在大型基因组数据库压缩中的应用。
BMC Genomics. 2019 Dec 30;20(Suppl 10):978. doi: 10.1186/s12864-019-6310-0.
3
Compression of Multiple DNA Sequences Using Intra-Sequence and Inter-Sequence Similarities.利用序列内和序列间相似性对多个DNA序列进行压缩
IEEE/ACM Trans Comput Biol Bioinform. 2015 Nov-Dec;12(6):1322-32. doi: 10.1109/TCBB.2015.2403370.
4
On-Demand Indexing for Referential Compression of DNA Sequences.用于DNA序列引用压缩的按需索引
PLoS One. 2015 Jul 6;10(7):e0132460. doi: 10.1371/journal.pone.0132460. eCollection 2015.
5
CoGI: Towards Compressing Genomes as an Image.CoGI:迈向将基因组压缩为图像
IEEE/ACM Trans Comput Biol Bioinform. 2015 Nov-Dec;12(6):1275-85. doi: 10.1109/TCBB.2015.2430331.
6
FRESCO: Referential compression of highly similar sequences.FRESCO:高度相似序列的参考压缩
IEEE/ACM Trans Comput Biol Bioinform. 2013 Sep-Oct;10(5):1275-88. doi: 10.1109/tcbb.2013.122.
7
ERGC: an efficient referential genome compression algorithm.ERGC:一种高效的参考基因组压缩算法。
Bioinformatics. 2015 Nov 1;31(21):3468-75. doi: 10.1093/bioinformatics/btv399. Epub 2015 Jul 2.
8
Compression of next-generation sequencing quality scores using memetic algorithm.基于遗传算法的下一代测序质量评分压缩方法。
BMC Bioinformatics. 2014;15 Suppl 15(Suppl 15):S10. doi: 10.1186/1471-2105-15-S15-S10. Epub 2014 Dec 3.
9
Efficient storage of high throughput DNA sequencing data using reference-based compression.利用基于参考的压缩技术高效存储高通量 DNA 测序数据。
Genome Res. 2011 May;21(5):734-40. doi: 10.1101/gr.114819.110. Epub 2011 Jan 18.
10
WBFQC: A new approach for compressing next-generation sequencing data splitting into homogeneous streams.WBFQC:一种将下一代测序数据分割为同质流进行压缩的新方法。
J Bioinform Comput Biol. 2018 Oct;16(5):1850018. doi: 10.1142/S021972001850018X. Epub 2018 Jun 28.

引用本文的文献

1
SparkGC: Spark based genome compression for large collections of genomes.SparkGC:基于 Spark 的基因组压缩方法,适用于大规模基因组集合。
BMC Bioinformatics. 2022 Jul 25;23(1):297. doi: 10.1186/s12859-022-04825-5.
2
Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review.用于组装基因组的垂直无损基因组数据压缩工具:系统文献回顾。
PLoS One. 2020 May 26;15(5):e0232942. doi: 10.1371/journal.pone.0232942. eCollection 2020.
3
Sketch distance-based clustering of chromosomes for large genome database compression.
基于草图距离的染色体聚类在大型基因组数据库压缩中的应用。
BMC Genomics. 2019 Dec 30;20(Suppl 10):978. doi: 10.1186/s12864-019-6310-0.