Butler Jonathan, MacCallum Iain, Kleber Michael, Shlyakhter Ilya A, Belmonte Matthew K, Lander Eric S, Nusbaum Chad, Jaffe David B
Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02141, USA.
Genome Res. 2008 May;18(5):810-20. doi: 10.1101/gr.7337908. Epub 2008 Mar 13.
New DNA sequencing technologies deliver data at dramatically lower costs but demand new analytical methods to take full advantage of the very short reads that they produce. We provide an initial, theoretical solution to the challenge of de novo assembly from whole-genome shotgun "microreads." For 11 genomes of sizes up to 39 Mb, we generated high-quality assemblies from 80x coverage by paired 30-base simulated reads modeled after real Illumina-Solexa reads. The bacterial genomes of Campylobacter jejuni and Escherichia coli assemble optimally, yielding single perfect contigs, and larger genomes yield assemblies that are highly connected and accurate. Assemblies are presented in a graph form that retains intrinsic ambiguities such as those arising from polymorphism, thereby providing information that has been absent from previous genome assemblies. For both C. jejuni and E. coli, this assembly graph is a single edge encompassing the entire genome. Larger genomes produce more complicated graphs, but the vast majority of the bases in their assemblies are present in long edges that are nearly always perfect. We describe a general method for genome assembly that can be applied to all types of DNA sequence data, not only short read data, but also conventional sequence reads.
新的DNA测序技术能以低得多的成本产出数据,但需要新的分析方法来充分利用其产生的极短读段。我们针对从全基因组鸟枪法“微读段”进行从头组装的挑战提供了一个初步的理论解决方案。对于大小达39 Mb的11个基因组,我们通过由模拟真实Illumina-Solexa读段构建的配对30碱基读段,从80倍覆盖度生成了高质量组装。空肠弯曲菌和大肠杆菌的细菌基因组组装效果最佳,产生单个完美重叠群,而较大的基因组产生的组装结果高度连通且准确。组装结果以图形形式呈现,保留了诸如由多态性产生的内在模糊性,从而提供了以往基因组组装中所没有的信息。对于空肠弯曲菌和大肠杆菌,这个组装图是一条包含整个基因组的单一边。较大的基因组产生更复杂的图,但它们组装中的绝大多数碱基存在于几乎总是完美的长边上。我们描述了一种可应用于所有类型DNA序列数据的基因组组装通用方法,不仅适用于短读段数据,也适用于传统序列读段。