Koren Sergey, Walenz Brian P, Berlin Konstantin, Miller Jason R, Bergman Nicholas H, Phillippy Adam M
Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA.
Invincea Incorporated, Fairfax, Virginia 22030, USA.
Genome Res. 2017 May;27(5):722-736. doi: 10.1101/gr.215087.116. Epub 2017 Mar 15.
Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.
长读长单分子测序彻底改变了从头基因组组装,并实现了参考质量基因组的自动重建。然而,鉴于此类技术相对较高的错误率,大型重复序列和密切相关单倍型的高效准确组装仍然具有挑战性。我们使用Canu解决了这些问题,Canu是Celera Assembler的后继者,专门为有噪声的单分子序列设计。Canu引入了对纳米孔测序的支持,将覆盖深度要求减半,并提高了组装连续性,同时与Celera Assembler 8.2相比,在大型基因组上运行时间减少了一个数量级。这些进步源于新的重叠和组装算法,包括基于加权MinHash的自适应重叠策略和避免折叠分歧重复序列和单倍型的稀疏组装图构建。我们证明,Canu可以使用太平洋生物科学公司(PacBio)或牛津纳米孔技术可靠地组装完整的微生物基因组和近乎完整的真核染色体,并且在人类和PacBio数据集上实现了>21 Mbp的重叠群NG50。对于无法线性表示的组装结构,Canu以图形片段组装(GFA)格式提供基于图的组装输出,以便与互补的定相和支架技术进行分析或整合。这种高度解析的组装图与长程支架信息的结合有望实现复杂基因组的完整和自动组装。