Al-Okaily Anas A
Computer Science & Engineering Department, University of Connecticut, Storrs, 06269, CT, USA.
BMC Genomics. 2016 Mar 5;17:193. doi: 10.1186/s12864-016-2515-7.
Current high-throughput sequencing technologies generate large numbers of relatively short and error-prone reads, making the de novo assembly problem challenging. Although high quality assemblies can be obtained by assembling multiple paired-end libraries with both short and long insert sizes, the latter are costly to generate. Recently, GAGE-B study showed that a remarkably good assembly quality can be obtained for bacterial genomes by state-of-the-art assemblers run on a single short-insert library with very high coverage.
In this paper, we introduce a novel hierarchical genome assembly (HGA) methodology that takes further advantage of such very high coverage by independently assembling disjoint subsets of reads, combining assemblies of the subsets, and finally re-assembling the combined contigs along with the original reads.
We empirically evaluated this methodology for 8 leading assemblers using 7 GAGE-B bacterial datasets consisting of 100 bp Illumina HiSeq and 250 bp Illumina MiSeq reads, with coverage ranging from 100x- ∼200x. The results show that for all evaluated datasets and using most evaluated assemblers (that were used to assemble the disjoint subsets), HGA leads to a significant improvement in the quality of the assembly based on N50 and corrected N50 metrics.
当前的高通量测序技术会生成大量相对较短且容易出错的读段,这使得从头组装问题具有挑战性。尽管通过组装具有短插入片段大小和长插入片段大小的多个双末端文库可以获得高质量的组装结果,但生成后者成本很高。最近,GAGE - B研究表明,通过在具有非常高覆盖度的单个短插入片段文库上运行最先进的组装程序,可以获得细菌基因组的非常好的组装质量。
在本文中,我们介绍了一种新颖的分层基因组组装(HGA)方法,该方法通过独立组装读段的不相交子集、合并子集的组装结果,最后将合并的重叠群与原始读段一起重新组装,进一步利用了这种非常高的覆盖度。
我们使用7个GAGE - B细菌数据集,对8种领先的组装程序进行了实证评估,这些数据集由100 bp的Illumina HiSeq和250 bp的Illumina MiSeq读段组成,覆盖度范围为100x - 约200x。结果表明,对于所有评估的数据集以及使用大多数评估的组装程序(用于组装不相交子集),基于N50和校正后的N50指标,HGA会显著提高组装质量。