Suppr超能文献

iDoComp:一种用于组装基因组的压缩方案。

iDoComp: a compression scheme for assembled genomes.

作者信息

Ochoa Idoia, Hernaez Mikel, Weissman Tsachy

机构信息

Department of Electrical Engineering, Stanford University, 350 Serra Mall, Stanford, CA.

出版信息

Bioinformatics. 2015 Mar 1;31(5):626-33. doi: 10.1093/bioinformatics/btu698. Epub 2014 Oct 24.

Abstract

MOTIVATION

With the release of the latest next-generation sequencing (NGS) machine, the HiSeq X by Illumina, the cost of sequencing a Human has dropped to a mere $4000. Thus we are approaching a milestone in the sequencing history, known as the $1000 genome era, where the sequencing of individuals is affordable, opening the doors to effective personalized medicine. Massive generation of genomic data, including assembled genomes, is expected in the following years. There is crucial need for compression of genomes guaranteed of performing well simultaneously on different species, from simple bacteria to humans, which will ease their transmission, dissemination and analysis. Further, most of the new genomes to be compressed will correspond to individuals of a species from which a reference already exists on the database. Thus, it is natural to propose compression schemes that assume and exploit the availability of such references.

RESULTS

We propose iDoComp, a compressor of assembled genomes presented in FASTA format that compresses an individual genome using a reference genome for both the compression and the decompression. In terms of compression efficiency, iDoComp outperforms previously proposed algorithms in most of the studied cases, with comparable or better running time. For example, we observe compression gains of up to 60% in several cases, including H.sapiens data, when comparing with the best compression performance among the previously proposed algorithms.

AVAILABILITY

iDoComp is written in C and can be downloaded from: http://www.stanford.edu/~iochoa/iDoComp.html (We also provide a full explanation on how to run the program and an example with all the necessary files to run it.).

摘要

动机

随着最新一代测序(NGS)机器Illumina HiSeq X的发布,对人类进行测序的成本已降至仅4000美元。因此,我们正迈向测序历史上的一个里程碑,即1000美元基因组时代,在这个时代,对个体进行测序变得经济可行,为有效的个性化医疗打开了大门。预计在接下来的几年中会大量生成基因组数据,包括组装好的基因组。迫切需要一种能在从简单细菌到人类等不同物种上都能良好运行的基因组压缩方法,这将便于基因组数据的传输、传播和分析。此外,大多数要压缩的新基因组将对应于数据库中已存在参考序列的物种个体。因此,提出利用此类参考序列可用性的压缩方案是很自然的。

结果

我们提出了iDoComp,一种以FASTA格式呈现的组装基因组压缩器,它在压缩和解压缩过程中都使用参考基因组来压缩单个基因组。在压缩效率方面,iDoComp在大多数研究案例中都优于先前提出的算法,且运行时间相当或更短。例如,在与先前提出的算法中最佳压缩性能进行比较时,我们发现在包括人类数据在内的几个案例中压缩率提高了60%。

可用性

iDoComp是用C语言编写的,可以从以下网址下载:http://www.stanford.edu/~iochoa/iDoComp.html(我们还提供了关于如何运行该程序的完整说明以及一个包含运行所需所有文件的示例。)

相似文献

1
iDoComp: a compression scheme for assembled genomes.iDoComp:一种用于组装基因组的压缩方案。
Bioinformatics. 2015 Mar 1;31(5):626-33. doi: 10.1093/bioinformatics/btu698. Epub 2014 Oct 24.
2
ERGC: an efficient referential genome compression algorithm.ERGC:一种高效的参考基因组压缩算法。
Bioinformatics. 2015 Nov 1;31(21):3468-75. doi: 10.1093/bioinformatics/btv399. Epub 2015 Jul 2.
6
Genome compression: a novel approach for large collections.基因组压缩:一种用于大型数据集的新方法。
Bioinformatics. 2013 Oct 15;29(20):2572-8. doi: 10.1093/bioinformatics/btt460. Epub 2013 Aug 21.
7
smallWig: parallel compression of RNA-seq WIG files.smallWig:RNA序列WIG文件的并行压缩
Bioinformatics. 2016 Jan 15;32(2):173-80. doi: 10.1093/bioinformatics/btv561. Epub 2015 Sep 30.
8
Aligned genomic data compression via improved modeling.通过改进建模实现对齐基因组数据压缩
J Bioinform Comput Biol. 2014 Dec;12(6):1442002. doi: 10.1142/S0219720014420025.

引用本文的文献

1
Bioinformatics tools for the sequence complexity estimates.用于序列复杂性估计的生物信息学工具。
Biophys Rev. 2023 Sep 15;15(5):1367-1378. doi: 10.1007/s12551-023-01140-y. eCollection 2023 Oct.
5
MBGC: Multiple Bacteria Genome Compressor.MBGC:多细菌基因组压缩器。
Gigascience. 2022 Jan 27;11. doi: 10.1093/gigascience/giab099.
8
Efficient DNA sequence compression with neural networks.神经网络高效 DNA 序列压缩。
Gigascience. 2020 Nov 11;9(11). doi: 10.1093/gigascience/giaa119.

本文引用的文献

1
FRESCO: Referential compression of highly similar sequences.FRESCO:高度相似序列的参考压缩
IEEE/ACM Trans Comput Biol Bioinform. 2013 Sep-Oct;10(5):1275-88. doi: 10.1109/tcbb.2013.122.
2
High-throughput DNA sequence data compression.高通量DNA序列数据压缩
Brief Bioinform. 2015 Jan;16(1):1-15. doi: 10.1093/bib/bbt087. Epub 2013 Dec 3.
3
Data compression for sequencing data.测序数据的数据压缩
Algorithms Mol Biol. 2013 Nov 18;8(1):25. doi: 10.1186/1748-7188-8-25.
4
Genome compression: a novel approach for large collections.基因组压缩:一种用于大型数据集的新方法。
Bioinformatics. 2013 Oct 15;29(20):2572-8. doi: 10.1093/bioinformatics/btt460. Epub 2013 Aug 21.
5
The human genome contracts again.人类基因组再次收缩。
Bioinformatics. 2013 Sep 1;29(17):2199-202. doi: 10.1093/bioinformatics/btt362. Epub 2013 Jun 22.
6
Adaptive efficient compression of genomes.基因组的自适应高效压缩
Algorithms Mol Biol. 2012 Nov 12;7(1):30. doi: 10.1186/1748-7188-7-30.
8
Robust relative compression of genomes with random access.具有随机访问的基因组的稳健相对压缩。
Bioinformatics. 2011 Nov 1;27(21):2979-86. doi: 10.1093/bioinformatics/btr505. Epub 2011 Sep 5.
9
The variant call format and VCFtools.变异调用格式和 VCFtools。
Bioinformatics. 2011 Aug 1;27(15):2156-8. doi: 10.1093/bioinformatics/btr330. Epub 2011 Jun 7.
10
Iterative dictionary construction for compression of large DNA data sets.迭代字典构建用于大型 DNA 数据集的压缩。
IEEE/ACM Trans Comput Biol Bioinform. 2012 Jan-Feb;9(1):137-49. doi: 10.1109/TCBB.2011.82. Epub 2011 Apr 27.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验