Suppr超能文献

HiCanu:从高保真长读段中精确组装片段重复、卫星和等位基因变体。

HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads.

机构信息

Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20894, USA.

Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA.

出版信息

Genome Res. 2020 Sep;30(9):1291-1305. doi: 10.1101/gr.263566.120. Epub 2020 Aug 14.

Abstract

Complete and accurate genome assemblies form the basis of most downstream genomic analyses and are of critical importance. Recent genome assembly projects have relied on a combination of noisy long-read sequencing and accurate short-read sequencing, with the former offering greater assembly continuity and the latter providing higher consensus accuracy. The recently introduced Pacific Biosciences (PacBio) HiFi sequencing technology bridges this divide by delivering long reads (>10 kbp) with high per-base accuracy (>99.9%). Here we present HiCanu, a modification of the Canu assembler designed to leverage the full potential of HiFi reads via homopolymer compression, overlap-based error correction, and aggressive false overlap filtering. We benchmark HiCanu with a focus on the recovery of haplotype diversity, major histocompatibility complex (MHC) variants, satellite DNAs, and segmental duplications. For diploid human genomes sequenced to 30× HiFi coverage, HiCanu achieved superior accuracy and allele recovery compared to the current state of the art. On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultralong Oxford Nanopore Technologies (ONT) reads in terms of both accuracy and continuity. This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of nine complete human centromeric regions. Although gaps and errors still remain within the most challenging regions of the genome, these results represent a significant advance toward the complete assembly of human genomes.

摘要

完整准确的基因组组装是大多数下游基因组分析的基础,具有至关重要的意义。最近的基因组组装项目依赖于嘈杂的长读测序和准确的短读测序的结合,前者提供更高的组装连续性,后者提供更高的一致性准确性。最近推出的 Pacific Biosciences (PacBio) HiFi 测序技术通过提供具有高精度 (>99.9%)的长读 (>10 kbp) 来弥合这一差距。在这里,我们提出了 HiCanu,这是对 Canu 组装器的修改,旨在通过同源多聚体压缩、基于重叠的错误校正和激进的虚假重叠过滤来充分利用 HiFi 读取的潜力。我们重点评估了 HiCanu 对单倍型多样性、主要组织相容性复合体 (MHC) 变体、卫星 DNA 和片段重复的恢复能力。对于用 30×HiFi 覆盖度测序的二倍体人类基因组,与当前最先进的技术相比,HiCanu 实现了更高的准确性和等位基因恢复。在有效单倍体 CHM13 人类细胞系上,HiCanu 实现了 77 Mbp 的 NG50 连续体大小,每个碱基的一致性准确性为 99.999%(QV50),在准确性和连续性方面均超过了最近使用高覆盖率、超长 Oxford Nanopore Technologies (ONT) 读取的组装结果。这个 HiCanu 组装正确地解决了 341 个已知片段重复中 337 个 BAC 采样的验证,提供了九个完整人类着丝粒区域的第一个初步组装。尽管基因组中最具挑战性的区域仍然存在缺口和错误,但这些结果代表了朝着人类基因组完整组装迈出的重要一步。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00a3/7545148/8c2214347ce3/1291f01.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验