Dohm Juliane C, Lottaz Claudio, Borodina Tatiana, Himmelbauer Heinz
Max-Planck-Institute for Molecular Genetics, 14195 Berlin-Dahlem, Germany.
Genome Res. 2007 Nov;17(11):1697-706. doi: 10.1101/gr.6435207. Epub 2007 Oct 1.
The latest revolution in the DNA sequencing field has been brought about by the development of automated sequencers that are capable of generating giga base pair data sets quickly and at low cost. Applications of such technologies seem to be limited to resequencing and transcript discovery, due to the shortness of the generated reads. In order to extend the fields of application to de novo sequencing, we developed the SHARCGS algorithm to assemble short-read (25-40-mer) data with high accuracy and speed. The efficiency of SHARCGS was tested on BAC inserts from three eukaryotic species, on two yeast chromosomes, and on two bacterial genomes (Haemophilus influenzae, Escherichia coli). We show that 30-mer-based BAC assemblies have N50 sizes >20 kbp for Drosophila and Arabidopsis and >4 kbp for human in simulations taking missing reads and wrong base calls into account. We assembled 949,974 contigs with length >50 bp, and only one single contig could not be aligned error-free against the reference sequences. We generated 36-mer reads for the genome of Helicobacter acinonychis on the Illumina 1G sequencing instrument and assembled 937 contigs covering 98% of the genome with an N50 size of 3.7 kbp. With the exception of five contigs that differ in 1-4 positions relative to the reference sequence, all contigs matched the genome error-free. Thus, SHARCGS is a suitable tool for fully exploiting novel sequencing technologies by assembling sequence contigs de novo with high confidence and by outperforming existing assembly algorithms in terms of speed and accuracy.
DNA测序领域的最新变革是由自动化测序仪的发展带来的,这些测序仪能够快速且低成本地生成吉碱基对数据集。由于所生成读段较短,此类技术的应用似乎仅限于重测序和转录本发现。为了将应用领域扩展到从头测序,我们开发了SHARCGS算法,以高精度和高速度组装短读段(25 - 40碱基)数据。在来自三种真核生物的BAC插入片段、两条酵母染色体以及两个细菌基因组(流感嗜血杆菌、大肠杆菌)上测试了SHARCGS的效率。在考虑缺失读段和错误碱基调用的模拟中,我们发现基于30碱基的BAC组装对于果蝇和拟南芥的N50大小大于20 kbp,对于人类则大于4 kbp。我们组装了949,974个长度大于50 bp的重叠群,并且只有一个重叠群无法与参考序列无错误比对。我们在Illumina 1G测序仪上为犬幽门螺杆菌基因组生成了36碱基读段,并组装了937个覆盖基因组98%的重叠群,N50大小为3.7 kbp。除了五个与参考序列在1 - 4个位置不同的重叠群外,所有重叠群都与基因组无错误匹配。因此,SHARCGS是一种合适的工具,能够通过高可信度地从头组装序列重叠群,并在速度和准确性方面优于现有组装算法,从而充分利用新型测序技术。