Liu Binghang, Liu Chi-Man, Li Dinghua, Li Yingrui, Ting Hing-Fung, Yiu Siu-Ming, Luo Ruibang, Lam Tak-Wah
Bioinformatics Algorithms Research Laboratory, Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong.
BMC Genomics. 2016 Aug 31;17 Suppl 5(Suppl 5):499. doi: 10.1186/s12864-016-2829-5.
De novo genome assembly using NGS data remains a computation-intensive task especially for large genomes. In practice, efficiency is often a primary concern and favors using a more efficient assembler like SOAPdenovo2. Yet SOAPdenovo2, based on de Bruijn graph, fails to take full advantage of longer NGS reads (say, 150 bp to 250 bp from Illumina HiSeq and MiSeq). Assemblers that are based on string graphs (e.g., SGA), though less popular and also very slow, are more favorable for longer reads.
This paper shows a new de novo assembler called BASE. It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs.
Experiments on two bacteria and four human datasets shows the advantage of BASE in both contig quality and speed in dealing with longer reads. In the experiment on bacteria, two datasets with read length of 100 bp and 250 bp were used.. Especially for the 250 bp dataset, BASE gives much better quality than SOAPdenovo2 and SGA and is simlilar to SPAdes. Regarding speed, BASE is consistently a few times faster than SPAdes and SGA, but still slower than SOAPdenovo2. BASE and Soapdenov2 are further compared using human datasets with read length 100 bp, 150 bp and 250 bp. BASE shows a higher N50 for all datasets, while the improvement becomes more significant when read length reaches 250 bp. Besides, BASE is more-meory efficent than SOAPdenovo2 when sequencing data with error rate.
BASE is a practically efficient tool for constructing contig, with significant improvement in quality for long NGS reads. It is relatively easy to extend BASE to include scaffolding.
利用二代测序(NGS)数据进行从头基因组组装仍然是一项计算密集型任务,对于大型基因组尤其如此。在实际应用中,效率往往是首要考虑因素,因此更倾向于使用像SOAPdenovo2这样更高效的组装器。然而,基于德布鲁因图(de Bruijn graph)的SOAPdenovo2未能充分利用更长的NGS读段(例如,来自Illumina HiSeq和MiSeq的150 bp至250 bp读段)。基于字符串图(string graph)的组装器(如SGA)虽然不太流行且速度也非常慢,但更适合处理更长的读段。
本文展示了一种名为BASE的新型从头组装器。它通过对读段进行高效索引来增强经典的种子延伸方法,以生成在基因组中出现唯一的高概率自适应种子。这些种子构成了BASE构建延伸树的基础,然后使用反向验证根据读段覆盖度和双末端信息去除分支,从而得到共享这些种子的读段的高质量一致序列。然后将这些一致序列延伸为重叠群(contig)。
在两个细菌数据集和四个人类数据集上的实验表明,BASE在处理更长读段时,在重叠群质量和速度方面都具有优势。在细菌实验中,使用了读长为100 bp和250 bp的两个数据集。特别是对于250 bp的数据集,BASE给出的质量比SOAPdenovo2和SGA好得多,并且与SPAdes相似。在速度方面,BASE始终比SPAdes和SGA快几倍,但仍比SOAPdenovo2慢。使用读长为100 bp、150 bp和250 bp的人类数据集对BASE和Soapdenov2进行了进一步比较。对于所有数据集,BASE都显示出更高的N50,而当读长达到250 bp时,这种改进变得更加显著。此外,在处理有错误率的测序数据时,BASE比SOAPdenovo2更节省内存。
BASE是一种用于构建重叠群的实用高效工具,在处理长NGS读段时质量有显著提高。将BASE扩展到包括搭建支架(scaffolding)相对容易。