Ochoa Idoia, Hernaez Mikel, Weissman Tsachy
Department of Electrical Engineering, Stanford University, 350 Serra Mall, Stanford, CA.
Bioinformatics. 2015 Mar 1;31(5):626-33. doi: 10.1093/bioinformatics/btu698. Epub 2014 Oct 24.
With the release of the latest next-generation sequencing (NGS) machine, the HiSeq X by Illumina, the cost of sequencing a Human has dropped to a mere $4000. Thus we are approaching a milestone in the sequencing history, known as the $1000 genome era, where the sequencing of individuals is affordable, opening the doors to effective personalized medicine. Massive generation of genomic data, including assembled genomes, is expected in the following years. There is crucial need for compression of genomes guaranteed of performing well simultaneously on different species, from simple bacteria to humans, which will ease their transmission, dissemination and analysis. Further, most of the new genomes to be compressed will correspond to individuals of a species from which a reference already exists on the database. Thus, it is natural to propose compression schemes that assume and exploit the availability of such references.
We propose iDoComp, a compressor of assembled genomes presented in FASTA format that compresses an individual genome using a reference genome for both the compression and the decompression. In terms of compression efficiency, iDoComp outperforms previously proposed algorithms in most of the studied cases, with comparable or better running time. For example, we observe compression gains of up to 60% in several cases, including H.sapiens data, when comparing with the best compression performance among the previously proposed algorithms.
iDoComp is written in C and can be downloaded from: http://www.stanford.edu/~iochoa/iDoComp.html (We also provide a full explanation on how to run the program and an example with all the necessary files to run it.).
随着最新一代测序(NGS)机器Illumina HiSeq X的发布,对人类进行测序的成本已降至仅4000美元。因此,我们正迈向测序历史上的一个里程碑,即1000美元基因组时代,在这个时代,对个体进行测序变得经济可行,为有效的个性化医疗打开了大门。预计在接下来的几年中会大量生成基因组数据,包括组装好的基因组。迫切需要一种能在从简单细菌到人类等不同物种上都能良好运行的基因组压缩方法,这将便于基因组数据的传输、传播和分析。此外,大多数要压缩的新基因组将对应于数据库中已存在参考序列的物种个体。因此,提出利用此类参考序列可用性的压缩方案是很自然的。
我们提出了iDoComp,一种以FASTA格式呈现的组装基因组压缩器,它在压缩和解压缩过程中都使用参考基因组来压缩单个基因组。在压缩效率方面,iDoComp在大多数研究案例中都优于先前提出的算法,且运行时间相当或更短。例如,在与先前提出的算法中最佳压缩性能进行比较时,我们发现在包括人类数据在内的几个案例中压缩率提高了60%。
iDoComp是用C语言编写的,可以从以下网址下载:http://www.stanford.edu/~iochoa/iDoComp.html(我们还提供了关于如何运行该程序的完整说明以及一个包含运行所需所有文件的示例。)