Canu：通过自适应k-mer加权和重复序列分离实现可扩展且准确的长读长序列拼接

Canu: scalable and accurate long-read assembly via adaptive -mer weighting and repeat separation.

作者信息

Koren Sergey, Walenz Brian P, Berlin Konstantin, Miller Jason R, Bergman Nicholas H, Phillippy Adam M

机构信息

Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA.

Invincea Incorporated, Fairfax, Virginia 22030, USA.

出版信息

Genome Res. 2017 May;27(5):722-736. doi: 10.1101/gr.215087.116. Epub 2017 Mar 15.

DOI:10.1101/gr.215087.116

PMID:28298431

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5411767/

Abstract

Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.

摘要

长读长单分子测序彻底改变了从头基因组组装，并实现了参考质量基因组的自动重建。然而，鉴于此类技术相对较高的错误率，大型重复序列和密切相关单倍型的高效准确组装仍然具有挑战性。我们使用Canu解决了这些问题，Canu是Celera Assembler的后继者，专门为有噪声的单分子序列设计。Canu引入了对纳米孔测序的支持，将覆盖深度要求减半，并提高了组装连续性，同时与Celera Assembler 8.2相比，在大型基因组上运行时间减少了一个数量级。这些进步源于新的重叠和组装算法，包括基于加权MinHash的自适应重叠策略和避免折叠分歧重复序列和单倍型的稀疏组装图构建。我们证明，Canu可以使用太平洋生物科学公司（PacBio）或牛津纳米孔技术可靠地组装完整的微生物基因组和近乎完整的真核染色体，并且在人类和PacBio数据集上实现了>21 Mbp的重叠群NG50。对于无法线性表示的组装结构，Canu以图形片段组装（GFA）格式提供基于图的组装输出，以便与互补的定相和支架技术进行分析或整合。这种高度解析的组装图与长程支架信息的结合有望实现复杂基因组的完整和自动组装。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1fd7/5411767/d0ba4edfbdc8/722f01.jpg

相似文献

Canu: scalable and accurate long-read assembly via adaptive -mer weighting and repeat separation.Canu：通过自适应k-mer加权和重复序列分离实现可扩展且准确的长读长序列拼接

Genome Res. 2017 May;27(5):722-736. doi: 10.1101/gr.215087.116. Epub 2017 Mar 15.

HINGE: long-read assembly achieves optimal repeat resolution.HINGE：长读长组装可实现最佳的重复序列解析。

Genome Res. 2017 May;27(5):747-756. doi: 10.1101/gr.216465.116. Epub 2017 Mar 20.

Improved assembly of noisy long reads by k-mer validation.通过k-mer验证改进嘈杂长读段的组装。

Genome Res. 2016 Dec;26(12):1710-1720. doi: 10.1101/gr.209247.116. Epub 2016 Oct 7.

Fast and accurate de novo genome assembly from long uncorrected reads.从长的未校正读段中进行快速且准确的从头基因组组装。

Genome Res. 2017 May;27(5):737-746. doi: 10.1101/gr.214270.116. Epub 2017 Jan 18.

Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches.用于纳米孔数据的从头组装算法基准测试揭示了重叠布局一致（OLC）方法的最佳性能。

BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):507. doi: 10.1186/s12864-016-2895-8.

HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies.HapCUT2：适用于多种测序技术的强大且准确的单倍型组装工具。

Genome Res. 2017 May;27(5):801-812. doi: 10.1101/gr.213462.116. Epub 2016 Dec 9.

Hybrid assembly of the large and highly repetitive genome of , a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm.利用MaSuRCA巨读算法对面包小麦的祖先之一——[具体物种名称未给出]的大型高度重复基因组进行混合组装。

Genome Res. 2017 May;27(5):787-792. doi: 10.1101/gr.213405.116. Epub 2017 Jan 27.

HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads.HiCanu：从高保真长读段中精确组装片段重复、卫星和等位基因变体。

Genome Res. 2020 Sep;30(9):1291-1305. doi: 10.1101/gr.263566.120. Epub 2020 Aug 14.

ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter.ABySS 2.0：使用布隆过滤器对大型基因组进行资源高效组装。

Genome Res. 2017 May;27(5):768-777. doi: 10.1101/gr.214346.116. Epub 2017 Feb 23.

HySA: a Hybrid Structural variant Assembly approach using next-generation and single-molecule sequencing technologies.HySA：一种使用下一代测序技术和单分子测序技术的混合结构变异组装方法。

Genome Res. 2017 May;27(5):793-800. doi: 10.1101/gr.214767.116. Epub 2017 Jan 19.

引用本文的文献

Evaluating the diagnostic capabilities of nanopore sequencing for detection in blacklegged ticks.评估纳米孔测序技术在检测黑腿蜱方面的诊断能力。

bioRxiv. 2025 Aug 27:2025.08.26.672273. doi: 10.1101/2025.08.26.672273.

Next-generation sequencing applications in food science: fundamentals and recent advances.下一代测序技术在食品科学中的应用：基础与最新进展

Front Bioeng Biotechnol. 2025 Aug 20;13:1638957. doi: 10.3389/fbioe.2025.1638957. eCollection 2025.

Organelle Genome Characteristics and Phylogenetic Analysis of a Warm-Season Turfgrass (Poaceae).一种暖季型草坪草（禾本科）的细胞器基因组特征及系统发育分析

Biology (Basel). 2025 Aug 1;14(8):975. doi: 10.3390/biology14080975.

Rad51, Rad54, and ZMM proteins antagonize the mismatch repair system to promote fertility of budding yeast intraspecies hybrid zygotes.Rad51、Rad54和ZMM蛋白拮抗错配修复系统，以促进芽殖酵母种内杂交合子的育性。

Nucleic Acids Res. 2025 Aug 27;53(16). doi: 10.1093/nar/gkaf847.

Microbiota effects and predictors of s colonization after treatment with a vaginal live biotherapeutic: results from a randomized, double-blinded, placebo-controlled trial.阴道用活生物疗法治疗后微生物群的影响及s定植的预测因素：一项随机、双盲、安慰剂对照试验的结果

medRxiv. 2025 Aug 30:2025.08.18.25333897. doi: 10.1101/2025.08.18.25333897.

Genomic and Phenotypic Evaluation of the Gliadin-Degrading Probiotic Bacillus amyloliquefaciens EG025 from Cheonggukjang for Celiac Disease Treatment.用于治疗乳糜泻的清国酱源解醇溶蛋白益生菌解淀粉芽孢杆菌EG025的基因组和表型评估

Probiotics Antimicrob Proteins. 2025 Sep 2. doi: 10.1007/s12602-025-10728-7.

Chromatin dynamics of a large-sized genome provides insights into polyphenism and X0 dosage compensation of locusts.大型基因组的染色质动力学为蝗虫的多型现象和X0剂量补偿提供了见解。

Nat Genet. 2025 Sep 1. doi: 10.1038/s41588-025-02330-y.

Mobilome-mediated transcriptional activation of biosynthetic gene clusters and its impact on strain competitiveness in food fermentation microbiomes.移动基因组介导的生物合成基因簇转录激活及其对食品发酵微生物群落中菌株竞争力的影响。

Microbiome. 2025 Aug 28;13(1):191. doi: 10.1186/s40168-025-02180-0.

Accurately assembling nanopore sequencing data of highly pathogenic bacteria.准确组装高致病性细菌的纳米孔测序数据。

BMC Genomics. 2025 Aug 28;26(1):783. doi: 10.1186/s12864-025-11793-6.

Whole-genome duplications revealed by macronuclear genomes of five rare species of the model ciliates Paramecium.通过模式纤毛虫草履虫的五个稀有物种的大核基因组揭示的全基因组重复。

Sci China Life Sci. 2025 Aug 15. doi: 10.1007/s11427-024-2872-7.

本文引用的文献

de novo assembly and population genomic survey of natural yeast isolates with the Oxford Nanopore MinION sequencer.使用牛津纳米孔MinION测序仪对天然酵母分离株进行从头组装和群体基因组调查。

Gigascience. 2017 Feb 1;6(2):1-13. doi: 10.1093/gigascience/giw018.

Comparison of bacterial genome assembly software for MinION data and their applicability to medical microbiology.比较用于 MinION 数据的细菌基因组组装软件及其在医学微生物学中的适用性。

Microb Genom. 2016 Sep 8;2(9):e000085. doi: 10.1099/mgen.0.000085. eCollection 2016 Sep.

Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome.单分子测序和染色质构象捕获技术助力家山羊基因组的从头参考组装。

Nat Genet. 2017 Apr;49(4):643-650. doi: 10.1038/ng.3802. Epub 2017 Mar 6.

Fast and accurate de novo genome assembly from long uncorrected reads.从长的未校正读段中进行快速且准确的从头基因组组装。

Genome Res. 2017 May;27(5):737-746. doi: 10.1101/gr.214270.116. Epub 2017 Jan 18.

An improved genome assembly uncovers prolific tandem repeats in Atlantic cod.一种改进的基因组组装方法揭示了大西洋鳕鱼中丰富的串联重复序列。

BMC Genomics. 2017 Jan 18;18(1):95. doi: 10.1186/s12864-016-3448-x.

Phased diploid genome assembly with single-molecule real-time sequencing.基于单分子实时测序的阶段性二倍体基因组组装

Nat Methods. 2016 Dec;13(12):1050-1054. doi: 10.1038/nmeth.4035. Epub 2016 Oct 17.

DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies.DBG2OLC：利用第三代测序技术的长错误读长进行大规模基因组的高效组装。

Sci Rep. 2016 Aug 30;6:31900. doi: 10.1038/srep31900.

Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage.利用适度的长读长覆盖率对后生动物基因组进行连续且准确的从头组装。

Nucleic Acids Res. 2016 Nov 2;44(19):e147. doi: 10.1093/nar/gkw654. Epub 2016 Jul 25.

Increased plasmid copy number is essential for Yersinia T3SS function and virulence.质粒拷贝数的增加对于耶尔森氏菌 T3SS 的功能和毒力是必不可少的。

Science. 2016 Jul 29;353(6298):492-5. doi: 10.1126/science.aaf7501. Epub 2016 Jun 30.

Long-read sequencing and de novo assembly of a Chinese genome.长读测序和中国基因组的从头组装。

Nat Commun. 2016 Jun 30;7:12065. doi: 10.1038/ncomms12065.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

Canu：通过自适应k-mer加权和重复序列分离实现可扩展且准确的长读长序列拼接

Canu: scalable and accurate long-read assembly via adaptive -mer weighting and repeat separation.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献