McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA.
Genome Res. 2012 Mar;22(3):557-67. doi: 10.1101/gr.131383.111. Epub 2012 Jan 6.
New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.
新的测序技术极大地改变了全基因组测序的格局,使得科学家们能够启动众多项目来解码以前未测序的生物体的基因组。成本最低的技术可以在短短几天内对包括哺乳动物在内的大多数物种进行深度覆盖。这些项目之一生成的序列数据由数百万或数十亿个长度在 50 到 150nt 之间的短 DNA 序列(reads)组成。在大多数基因组分析开始之前,这些序列必须从头组装。不幸的是,基因组组装仍然是一个非常困难的问题,由于较短的读取和不可靠的长程连接信息而变得更加困难。在这项研究中,我们评估了几种领先的从头组装算法在四个不同的短读数据集上的性能,这些数据集都是由 Illumina 测序仪生成的。我们的结果描述了不同组装器的相对性能,以及似乎是基因组本身固有的其他显著的组装难度差异。有三个总体结论是显而易见的:首先,数据质量而不是组装器本身对组装基因组的质量有巨大影响;其次,组装的连续性程度在不同的组装器和不同的基因组之间差异巨大;第三,组装的正确性也差异很大,与连续性的统计数据相关性不大。为了使其他人能够复制我们的结果,我们所有的数据和方法都是免费提供的,本研究中使用的所有组装器也是免费提供的。