iDoComp：一种用于组装基因组的压缩方案。

iDoComp: a compression scheme for assembled genomes.

作者信息

Ochoa Idoia, Hernaez Mikel, Weissman Tsachy

机构信息

Department of Electrical Engineering, Stanford University, 350 Serra Mall, Stanford, CA.

出版信息

Bioinformatics. 2015 Mar 1;31(5):626-33. doi: 10.1093/bioinformatics/btu698. Epub 2014 Oct 24.

DOI:10.1093/bioinformatics/btu698

PMID:25344501

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5855886/

Abstract

MOTIVATION

With the release of the latest next-generation sequencing (NGS) machine, the HiSeq X by Illumina, the cost of sequencing a Human has dropped to a mere $4000. Thus we are approaching a milestone in the sequencing history, known as the $1000 genome era, where the sequencing of individuals is affordable, opening the doors to effective personalized medicine. Massive generation of genomic data, including assembled genomes, is expected in the following years. There is crucial need for compression of genomes guaranteed of performing well simultaneously on different species, from simple bacteria to humans, which will ease their transmission, dissemination and analysis. Further, most of the new genomes to be compressed will correspond to individuals of a species from which a reference already exists on the database. Thus, it is natural to propose compression schemes that assume and exploit the availability of such references.

RESULTS

We propose iDoComp, a compressor of assembled genomes presented in FASTA format that compresses an individual genome using a reference genome for both the compression and the decompression. In terms of compression efficiency, iDoComp outperforms previously proposed algorithms in most of the studied cases, with comparable or better running time. For example, we observe compression gains of up to 60% in several cases, including H.sapiens data, when comparing with the best compression performance among the previously proposed algorithms.

AVAILABILITY

iDoComp is written in C and can be downloaded from: http://www.stanford.edu/~iochoa/iDoComp.html (We also provide a full explanation on how to run the program and an example with all the necessary files to run it.).

摘要

动机

随着最新一代测序（NGS）机器Illumina HiSeq X的发布，对人类进行测序的成本已降至仅4000美元。因此，我们正迈向测序历史上的一个里程碑，即1000美元基因组时代，在这个时代，对个体进行测序变得经济可行，为有效的个性化医疗打开了大门。预计在接下来的几年中会大量生成基因组数据，包括组装好的基因组。迫切需要一种能在从简单细菌到人类等不同物种上都能良好运行的基因组压缩方法，这将便于基因组数据的传输、传播和分析。此外，大多数要压缩的新基因组将对应于数据库中已存在参考序列的物种个体。因此，提出利用此类参考序列可用性的压缩方案是很自然的。

结果

我们提出了iDoComp，一种以FASTA格式呈现的组装基因组压缩器，它在压缩和解压缩过程中都使用参考基因组来压缩单个基因组。在压缩效率方面，iDoComp在大多数研究案例中都优于先前提出的算法，且运行时间相当或更短。例如，在与先前提出的算法中最佳压缩性能进行比较时，我们发现在包括人类数据在内的几个案例中压缩率提高了60%。

可用性

iDoComp是用C语言编写的，可以从以下网址下载：http://www.stanford.edu/~iochoa/iDoComp.html（我们还提供了关于如何运行该程序的完整说明以及一个包含运行所需所有文件的示例。）

相似文献

iDoComp: a compression scheme for assembled genomes.iDoComp：一种用于组装基因组的压缩方案。

Bioinformatics. 2015 Mar 1;31(5):626-33. doi: 10.1093/bioinformatics/btu698. Epub 2014 Oct 24.

ERGC: an efficient referential genome compression algorithm.ERGC：一种高效的参考基因组压缩算法。

Bioinformatics. 2015 Nov 1;31(21):3468-75. doi: 10.1093/bioinformatics/btv399. Epub 2015 Jul 2.

Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.基于哈希的重排序压缩基因组测序reads：算法与分析。

Bioinformatics. 2018 Feb 15;34(4):558-567. doi: 10.1093/bioinformatics/btx639.

FaStore: a space-saving solution for raw sequencing data.FaStore：一种节省存储空间的原始测序数据解决方案。

Bioinformatics. 2018 Aug 15;34(16):2748-2756. doi: 10.1093/bioinformatics/bty205.

Algorithms designed for compressed-gene-data transformation among gene banks with different references.用于在具有不同参照的基因库之间进行压缩基因数据转换的算法。

BMC Bioinformatics. 2018 Jun 18;19(1):230. doi: 10.1186/s12859-018-2230-2.

Genome compression: a novel approach for large collections.基因组压缩：一种用于大型数据集的新方法。

Bioinformatics. 2013 Oct 15;29(20):2572-8. doi: 10.1093/bioinformatics/btt460. Epub 2013 Aug 21.

smallWig: parallel compression of RNA-seq WIG files.smallWig：RNA序列WIG文件的并行压缩

Bioinformatics. 2016 Jan 15;32(2):173-80. doi: 10.1093/bioinformatics/btv561. Epub 2015 Sep 30.

Aligned genomic data compression via improved modeling.通过改进建模实现对齐基因组数据压缩

J Bioinform Comput Biol. 2014 Dec;12(6):1442002. doi: 10.1142/S0219720014420025.

AFRESh: an adaptive framework for compression of reads and assembled sequences with random access functionality.AFRESh：一种具有随机访问功能的用于压缩读取数据和组装序列的自适应框架。

Bioinformatics. 2017 May 15;33(10):1464-1472. doi: 10.1093/bioinformatics/btx001.

WBFQC: A new approach for compressing next-generation sequencing data splitting into homogeneous streams.WBFQC：一种将下一代测序数据分割为同质流进行压缩的新方法。

J Bioinform Comput Biol. 2018 Oct;16(5):1850018. doi: 10.1142/S021972001850018X. Epub 2018 Jun 28.

引用本文的文献

Bioinformatics tools for the sequence complexity estimates.用于序列复杂性估计的生物信息学工具。

Biophys Rev. 2023 Sep 15;15(5):1367-1378. doi: 10.1007/s12551-023-01140-y. eCollection 2023 Oct.

SparkGC: Spark based genome compression for large collections of genomes.SparkGC：基于 Spark 的基因组压缩方法，适用于大规模基因组集合。

BMC Bioinformatics. 2022 Jul 25;23(1):297. doi: 10.1186/s12859-022-04825-5.

Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format.使用核苷酸存档格式对严重急性呼吸综合征冠状病毒2（SARS-CoV-2）基因组数据进行高效压缩。

Patterns (N Y). 2022 Sep 9;3(9):100562. doi: 10.1016/j.patter.2022.100562. Epub 2022 Jul 7.

A Hybrid Data-Differencing and Compression Algorithm for the Automotive Industry.一种用于汽车行业的混合数据差分与压缩算法。

Entropy (Basel). 2022 Apr 19;24(5):574. doi: 10.3390/e24050574.

MBGC: Multiple Bacteria Genome Compressor.MBGC：多细菌基因组压缩器。

Gigascience. 2022 Jan 27;11. doi: 10.1093/gigascience/giab099.

Information Theory in Computational Biology: Where We Stand Today.计算生物学中的信息论：我们如今的现状

Entropy (Basel). 2020 Jun 6;22(6):627. doi: 10.3390/e22060627.

Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes.基于压缩的度量方法在灵长类基因组进化中的应用比较

Entropy (Basel). 2018 May 23;20(6):393. doi: 10.3390/e20060393.

Efficient DNA sequence compression with neural networks.神经网络高效 DNA 序列压缩。

Gigascience. 2020 Nov 11;9(11). doi: 10.1093/gigascience/giaa119.

Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review.用于组装基因组的垂直无损基因组数据压缩工具：系统文献回顾。

PLoS One. 2020 May 26;15(5):e0232942. doi: 10.1371/journal.pone.0232942. eCollection 2020.

HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data.HRCM：一种用于基因组大数据的高效混合参考压缩方法。

Biomed Res Int. 2019 Nov 16;2019:3108950. doi: 10.1155/2019/3108950. eCollection 2019.

本文引用的文献

FRESCO: Referential compression of highly similar sequences.FRESCO：高度相似序列的参考压缩

IEEE/ACM Trans Comput Biol Bioinform. 2013 Sep-Oct;10(5):1275-88. doi: 10.1109/tcbb.2013.122.

High-throughput DNA sequence data compression.高通量DNA序列数据压缩

Brief Bioinform. 2015 Jan;16(1):1-15. doi: 10.1093/bib/bbt087. Epub 2013 Dec 3.

Data compression for sequencing data.测序数据的数据压缩

Algorithms Mol Biol. 2013 Nov 18;8(1):25. doi: 10.1186/1748-7188-8-25.

Genome compression: a novel approach for large collections.基因组压缩：一种用于大型数据集的新方法。

Bioinformatics. 2013 Oct 15;29(20):2572-8. doi: 10.1093/bioinformatics/btt460. Epub 2013 Aug 21.

The human genome contracts again.人类基因组再次收缩。

Bioinformatics. 2013 Sep 1;29(17):2199-202. doi: 10.1093/bioinformatics/btt362. Epub 2013 Jun 22.

Adaptive efficient compression of genomes.基因组的自适应高效压缩

Algorithms Mol Biol. 2012 Nov 12;7(1):30. doi: 10.1186/1748-7188-7-30.

GReEn: a tool for efficient compression of genome resequencing data.GReEn：一种用于高效压缩基因组重测序数据的工具。

Nucleic Acids Res. 2012 Feb;40(4):e27. doi: 10.1093/nar/gkr1124. Epub 2011 Dec 1.

Robust relative compression of genomes with random access.具有随机访问的基因组的稳健相对压缩。

Bioinformatics. 2011 Nov 1;27(21):2979-86. doi: 10.1093/bioinformatics/btr505. Epub 2011 Sep 5.

The variant call format and VCFtools.变异调用格式和 VCFtools。

Bioinformatics. 2011 Aug 1;27(15):2156-8. doi: 10.1093/bioinformatics/btr330. Epub 2011 Jun 7.

Iterative dictionary construction for compression of large DNA data sets.迭代字典构建用于大型 DNA 数据集的压缩。

IEEE/ACM Trans Comput Biol Bioinform. 2012 Jan-Feb;9(1):137-49. doi: 10.1109/TCBB.2011.82. Epub 2011 Apr 27.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验