Suppr超能文献

高效引用基因组压缩算法。

High efficiency referential genome compression algorithm.

机构信息

School of Information, Yunnan University, Kunming, China.

Information Security College, Yunnan Police College, Kunming, China.

出版信息

Bioinformatics. 2019 Jun 1;35(12):2058-2065. doi: 10.1093/bioinformatics/bty934.

Abstract

MOTIVATION

With the development and the gradually popularized application of next-generation sequencing technologies (NGS), genome sequencing has been becoming faster and cheaper, creating a massive amount of genome sequence data which still grows at an explosive rate. The time and cost of transmission, storage, processing and analysis of these genetic data have become bottlenecks that hinder the development of genetics and biomedicine. Although there are many common data compression algorithms, they are not effective for genome sequences due to their inability to consider and exploit the inherent characteristics of genome sequence data. Therefore, the development of a fast and efficient compression algorithm specific to genome data is an important and pressing issue.

RESULTS

We have developed a referential lossless genome data compression algorithm with better performance than previous algorithms. According to a carefully designed matching strategy selection mechanism, the advantages of local matching and global matching are reasonably combined together to improve the description efficiency of the matched sub-strings. The effects of the length and the position of matched sub-strings to the compression efficiency are jointly taken into consideration. The proposed algorithm can compress the FASTA data of complete human genomes, each of which is about 3 GB, in about 18 min. The compressed file sizes are ranging from a few megabytes to about forty megabytes. The averaged compression ratio is higher than that of the state-of-the-art genome compression algorithms, the time complexity is at the same order of the best-known algorithms.

AVAILABILITY AND IMPLEMENTATION

https://github.com/jhchen5/SCCG.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

随着下一代测序技术(NGS)的发展和逐渐普及应用,基因组测序变得越来越快、越来越便宜,产生了大量的基因组序列数据,而且这些数据仍在以爆炸式的速度增长。这些遗传数据的传输、存储、处理和分析的时间和成本已经成为阻碍遗传学和生物医学发展的瓶颈。尽管有许多常见的数据压缩算法,但由于它们无法考虑和利用基因组序列数据的固有特性,因此对于基因组序列并不有效。因此,开发一种针对基因组数据的快速有效的压缩算法是一个重要且紧迫的问题。

结果

我们开发了一种参考无损的基因组数据压缩算法,其性能优于以前的算法。根据精心设计的匹配策略选择机制,合理地结合了局部匹配和全局匹配的优点,以提高匹配子字符串的描述效率。共同考虑匹配子字符串的长度和位置对压缩效率的影响。所提出的算法可以在大约 18 分钟内压缩约 3GB 的完整人类基因组的 FASTA 数据。压缩文件的大小从几兆字节到大约四十兆字节不等。平均压缩比高于最先进的基因组压缩算法,时间复杂度与已知的最佳算法相同。

可用性和实现

https://github.com/jhchen5/SCCG。

补充信息

补充数据可在 Bioinformatics 在线获得。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验