高效引用基因组压缩算法。

High efficiency referential genome compression algorithm.

机构信息

School of Information, Yunnan University, Kunming, China.

Information Security College, Yunnan Police College, Kunming, China.

出版信息

Bioinformatics. 2019 Jun 1;35(12):2058-2065. doi: 10.1093/bioinformatics/bty934.

DOI:10.1093/bioinformatics/bty934

PMID:30407493

Abstract

MOTIVATION

With the development and the gradually popularized application of next-generation sequencing technologies (NGS), genome sequencing has been becoming faster and cheaper, creating a massive amount of genome sequence data which still grows at an explosive rate. The time and cost of transmission, storage, processing and analysis of these genetic data have become bottlenecks that hinder the development of genetics and biomedicine. Although there are many common data compression algorithms, they are not effective for genome sequences due to their inability to consider and exploit the inherent characteristics of genome sequence data. Therefore, the development of a fast and efficient compression algorithm specific to genome data is an important and pressing issue.

RESULTS

We have developed a referential lossless genome data compression algorithm with better performance than previous algorithms. According to a carefully designed matching strategy selection mechanism, the advantages of local matching and global matching are reasonably combined together to improve the description efficiency of the matched sub-strings. The effects of the length and the position of matched sub-strings to the compression efficiency are jointly taken into consideration. The proposed algorithm can compress the FASTA data of complete human genomes, each of which is about 3 GB, in about 18 min. The compressed file sizes are ranging from a few megabytes to about forty megabytes. The averaged compression ratio is higher than that of the state-of-the-art genome compression algorithms, the time complexity is at the same order of the best-known algorithms.

AVAILABILITY AND IMPLEMENTATION

https://github.com/jhchen5/SCCG.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

随着下一代测序技术（NGS）的发展和逐渐普及应用，基因组测序变得越来越快、越来越便宜，产生了大量的基因组序列数据，而且这些数据仍在以爆炸式的速度增长。这些遗传数据的传输、存储、处理和分析的时间和成本已经成为阻碍遗传学和生物医学发展的瓶颈。尽管有许多常见的数据压缩算法，但由于它们无法考虑和利用基因组序列数据的固有特性，因此对于基因组序列并不有效。因此，开发一种针对基因组数据的快速有效的压缩算法是一个重要且紧迫的问题。

结果

我们开发了一种参考无损的基因组数据压缩算法，其性能优于以前的算法。根据精心设计的匹配策略选择机制，合理地结合了局部匹配和全局匹配的优点，以提高匹配子字符串的描述效率。共同考虑匹配子字符串的长度和位置对压缩效率的影响。所提出的算法可以在大约 18 分钟内压缩约 3GB 的完整人类基因组的 FASTA 数据。压缩文件的大小从几兆字节到大约四十兆字节不等。平均压缩比高于最先进的基因组压缩算法，时间复杂度与已知的最佳算法相同。

可用性和实现

https://github.com/jhchen5/SCCG。

补充信息

补充数据可在 Bioinformatics 在线获得。

相似文献

High efficiency referential genome compression algorithm.高效引用基因组压缩算法。

Bioinformatics. 2019 Jun 1;35(12):2058-2065. doi: 10.1093/bioinformatics/bty934.

High-speed and high-ratio referential genome compression.高速高比参照基因组压缩。

Bioinformatics. 2017 Nov 1;33(21):3364-3372. doi: 10.1093/bioinformatics/btx412.

ERGC: an efficient referential genome compression algorithm.ERGC：一种高效的参考基因组压缩算法。

Bioinformatics. 2015 Nov 1;31(21):3468-75. doi: 10.1093/bioinformatics/btv399. Epub 2015 Jul 2.

ENANO: Encoder for NANOpore FASTQ files.ENANO：用于 Nanopore FASTQ 文件的编码器。

Bioinformatics. 2020 Aug 15;36(16):4506-4507. doi: 10.1093/bioinformatics/btaa551.

Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression.通过 (w, k)-最小化子索引后缀-前缀重叠来生成用于读取压缩的长连续体。

Bioinformatics. 2019 Jun 1;35(12):2066-2074. doi: 10.1093/bioinformatics/bty936.

Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.基于哈希的重排序压缩基因组测序reads：算法与分析。

Bioinformatics. 2018 Feb 15;34(4):558-567. doi: 10.1093/bioinformatics/btx639.

NRGC: a novel referential genome compression algorithm.NRGC：一种新型的参考基因组压缩算法。

Bioinformatics. 2016 Nov 15;32(22):3405-3412. doi: 10.1093/bioinformatics/btw505. Epub 2016 Aug 2.

SPRING: a next-generation compressor for FASTQ data.SPRING：FASTQ 数据的下一代压缩程序。

Bioinformatics. 2019 Aug 1;35(15):2674-2676. doi: 10.1093/bioinformatics/bty1015.

Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences.核苷酸档案格式 (NAF) 可实现 DNA 序列的高效无损、无参考自由压缩。

Bioinformatics. 2019 Oct 1;35(19):3826-3828. doi: 10.1093/bioinformatics/btz144.

GTRAC: fast retrieval from compressed collections of genomic variants.GTRAC：从基因组变异的压缩集合中快速检索

Bioinformatics. 2016 Sep 1;32(17):i479-i486. doi: 10.1093/bioinformatics/btw437.

引用本文的文献

SparkGC: Spark based genome compression for large collections of genomes.SparkGC：基于 Spark 的基因组压缩方法，适用于大规模基因组集合。

BMC Bioinformatics. 2022 Jul 25;23(1):297. doi: 10.1186/s12859-022-04825-5.

HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data.HRCM：一种用于基因组大数据的高效混合参考压缩方法。

Biomed Res Int. 2019 Nov 16;2019:3108950. doi: 10.1155/2019/3108950. eCollection 2019.

Sketch distance-based clustering of chromosomes for large genome database compression.基于草图距离的染色体聚类在大型基因组数据库压缩中的应用。

BMC Genomics. 2019 Dec 30;20(Suppl 10):978. doi: 10.1186/s12864-019-6310-0.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

高效引用基因组压缩算法。

High efficiency referential genome compression algorithm.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

补充信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献