Suppr超能文献

差分直接编码:一种用于核苷酸序列数据的压缩算法。

Differential direct coding: a compression algorithm for nucleotide sequence data.

作者信息

Vey Gregory

机构信息

Department of Biology, Wilfrid Laurier University, 75 University Avenue West, Waterloo ON, Canada N2L 3C5.

出版信息

Database (Oxford). 2009;2009:bap013. doi: 10.1093/database/bap013. Epub 2009 Sep 14.

Abstract

While modern hardware can provide vast amounts of inexpensive storage for biological databases, the compression of nucleotide sequence data is still of paramount importance in order to facilitate fast search and retrieval operations through a reduction in disk traffic. This issue becomes even more important in light of the recent increase of very large data sets, such as metagenomes. In this article, I propose the Differential Direct Coding algorithm, a general-purpose nucleotide compression protocol that can differentiate between sequence data and auxiliary data by supporting the inclusion of supplementary symbols that are not members of the set of expected nucleotide bases, thereby offering reconciliation between sequence-specific and general-purpose compression strategies. This algorithm permits a sequence to contain a rich lexicon of auxiliary symbols that can represent wildcards, annotation data and special subsequences, such as functional domains or special repeats. In particular, the representation of special subsequences can be incorporated to provide structure-based coding that increases the overall degree of compression. Moreover, supporting a robust set of symbols removes the requirement of wildcard elimination and restoration phases, resulting in a complexity of O(n) for execution time, making this algorithm suitable for very large data sets. Because this algorithm compresses data on the basis of triplets, it is highly amenable to interpretation as a polypeptide at decompression time. Also, an encoded sequence may be further compressed using other existing algorithms, like gzip, thereby maximizing the final degree of compression. Overall, the Differential Direct Coding algorithm can offer a beneficial impact on disk traffic for database queries and other disk-intensive operations.

摘要

虽然现代硬件可为生物数据库提供大量廉价存储,但核苷酸序列数据的压缩对于通过减少磁盘流量来促进快速搜索和检索操作仍至关重要。鉴于最近超大数据集(如宏基因组)的增加,这个问题变得更加重要。在本文中,我提出了差分直接编码算法,这是一种通用的核苷酸压缩协议,通过支持包含不属于预期核苷酸碱基集合的补充符号,能够区分序列数据和辅助数据,从而在序列特定压缩策略和通用压缩策略之间实现协调。该算法允许序列包含丰富的辅助符号词汇表,这些符号可以表示通配符、注释数据和特殊子序列,如功能域或特殊重复序列。特别是,可以纳入特殊子序列的表示以提供基于结构的编码,从而提高整体压缩程度。此外,支持一组强大的符号消除了通配符消除和恢复阶段的要求,使得执行时间的复杂度为O(n),这使得该算法适用于超大数据集。由于该算法基于三联体压缩数据,在解压缩时很容易被解释为多肽。而且,编码序列可以使用其他现有算法(如gzip)进一步压缩,从而最大化最终压缩程度。总体而言,差分直接编码算法可为数据库查询和其他磁盘密集型操作的磁盘流量带来有益影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d66b/2797453/b8e08dec4925/bap013f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验