Cherukuri Yesesri, Janga Sarath Chandra
Department of Bio Health Informatics, School of Informatics and Computing, Indiana University Purdue University, 719 Indiana Ave Ste 319, Walker Plaza Building, Indianapolis, IA, 46202, USA.
Centre for Computational Biology and Bioinformatics, Indiana University School of Medicine, 5021 Health Information and Translational Sciences (HITS), 410 West 10th Street, Indianapolis, IA, 46202, USA.
BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):507. doi: 10.1186/s12864-016-2895-8.
Improved DNA sequencing methods have transformed the field of genomics over the last decade. This has become possible due to the development of inexpensive short read sequencing technologies which have now resulted in three generations of sequencing platforms. More recently, a new fourth generation of Nanopore based single molecule sequencing technology, was developed based on MinION(®) sequencer which is portable, inexpensive and fast. It is capable of generating reads of length greater than 100 kb. Though it has many specific advantages, the two major limitations of the MinION reads are high error rates and the need for the development of downstream pipelines. The algorithms for error correction have already emerged, while development of pipelines is still at nascent stage.
In this study, we benchmarked available assembler algorithms to find an appropriate framework that can efficiently assemble Nanopore sequenced reads. To address this, we employed genome-scale Nanopore sequenced datasets available for E. coli and yeast genomes respectively. In order to comprehensively evaluate multiple algorithmic frameworks, we included assemblers based on de Bruijn graphs (Velvet and ABySS), Overlap Layout Consensus (OLC) (Celera) and Greedy extension (SSAKE) approaches. We analyzed the quality, accuracy of the assemblies as well as the computational performance of each of the assemblers included in our benchmark. Our analysis unveiled that OLC-based algorithm, Celera, could generate a high quality assembly with ten times higher N50 & mean contig values as well as one-fifth the number of total number of contigs compared to other tools. Celera was also found to exhibit an average genome coverage of 12 % in E. coli dataset and 70 % in Yeast dataset as well as relatively lesser run times. In contrast, de Bruijn graph based assemblers Velvet and ABySS generated the assemblies of moderate quality, in less time when there is no limitation on the memory allocation, while greedy extension based algorithm SSAKE generated an assembly of very poor quality but with genome coverage of 90 % on yeast dataset.
OLC can be considered as a favorable algorithmic framework for the development of assembler tools for Nanopore-based data, followed by de Bruijn based algorithms as they consume relatively less or similar run times as OLC-based algorithms for generating assembly, irrespective of the memory allocated for the task. However, few improvements must be made to the existing de Bruijn implementations in order to generate an assembly with reasonable quality. Our findings should help in stimulating the development of novel assemblers for handling Nanopore sequence data.
在过去十年中,改进的DNA测序方法改变了基因组学领域。这得益于廉价的短读长测序技术的发展,目前已产生了三代测序平台。最近,基于MinION(®)测序仪开发了新一代的基于纳米孔的单分子测序技术,该技术便携、廉价且快速。它能够生成长度超过100 kb的读段。尽管它有许多特定优势,但MinION读段的两个主要局限性是错误率高以及需要开发下游流程。纠错算法已经出现,而流程开发仍处于起步阶段。
在本研究中,我们对可用的组装算法进行了基准测试,以找到一个能够有效组装纳米孔测序读段的合适框架。为了解决这个问题,我们分别采用了可用于大肠杆菌和酵母基因组的基因组规模的纳米孔测序数据集。为了全面评估多个算法框架,我们纳入了基于de Bruijn图的组装器(Velvet和ABySS)、重叠布局一致(OLC)(Celera)和贪婪扩展(SSAKE)方法的组装器。我们分析了组装的质量、准确性以及我们基准测试中每个组装器的计算性能。我们的分析表明,基于OLC的算法Celera能够生成高质量的组装结果,其N50和平均重叠群值比其他工具高十倍,重叠群总数是其他工具的五分之一。还发现Celera在大肠杆菌数据集中的平均基因组覆盖率为12%,在酵母数据集中为70%,且运行时间相对较短。相比之下,基于de Bruijn图的组装器Velvet和ABySS在内存分配无限制时能在更短时间内生成中等质量的组装结果,而基于贪婪扩展的算法SSAKE生成的组装质量非常差,但在酵母数据集上的基因组覆盖率为90%。
OLC可被视为开发基于纳米孔数据的组装工具的有利算法框架,其次是基于de Bruijn的算法,因为它们在生成组装结果时消耗的运行时间相对较少或与基于OLC的算法相似,而与为任务分配的内存无关。然而,为了生成具有合理质量的组装结果,必须对现有的de Bruijn实现进行一些改进。我们的研究结果应有助于推动用于处理纳米孔序列数据的新型组装器的开发。