Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom.
Genome Res. 2012 Mar;22(3):549-56. doi: 10.1101/gr.126953.111. Epub 2011 Dec 7.
De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error-correct, assemble, and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs <200 bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster.
从头基因组序列组装对于生成以前未表征的基因组的新序列组装以及以无参考偏向的方式识别个体的基因组序列都很重要。我们提出了使用从压缩的 Burrows-Wheeler 变换得出的 FM-index 进行组装的内存高效数据结构和算法,以及一个基于这些算法的新的组装器,称为 SGA(字符串图组装器)。我们描述了用于纠错、组装和支架大量序列数据的算法。SGA 使用基于重叠的字符串图组装模型,与大多数依赖于 de Bruijn 图的从头组装器不同,并且可以简单地并行化。我们在人类基因组的 12 亿个序列读取上展示了 SGA 的纠错和组装性能,我们能够使用 54GB 的内存进行组装。得到的 contigs 高度准确且连续,同时覆盖了参考基因组的 95%(不包括长度小于 200bp 的 contigs)。由于内存需求低,并且无需进程间通信即可进行并行化,因此 SGA 是我们所知的第一个实用的组装器,可用于低端计算集群上的哺乳动物大小的基因组。