Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA.
UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA.
Nat Methods. 2022 Jun;19(6):687-695. doi: 10.1038/s41592-022-01440-3. Epub 2022 Mar 31.
Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first telomere-to-telomere human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Although derived from highly accurate sequences, evaluation revealed evidence of small errors and structural misassemblies in the initial draft assembly. To correct these errors, we designed a new repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly quality value from 70.2 to 73.9 measured from PacBio high-fidelity and Illumina k-mers. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both high-fidelity and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies.
长读测序技术和基因组组装方法的进步使得最近首次完成了端粒到端粒的人类基因组组装,解决了复杂的片段重复和大型串联重复,包括完整葡萄胎(CHM13)中的着丝粒卫星阵列。尽管源自高度准确的序列,但评估结果显示初始草案组装中存在小错误和结构组装错误的证据。为了纠正这些错误,我们设计了一种新的重复感知的抛光策略,该策略可以在不过度校正的情况下在大型重复中进行准确的组装校正,最终纠正了 51%的现有错误,并将组装质量值从 PacBio 高保真度和 Illumina k-mer 测量的 70.2 提高到 73.9。通过将我们的结果与标准自动化抛光工具进行比较,我们概述了常见的抛光错误,并为资源有限的基因组项目提供了实用建议。我们还展示了高保真度和 Oxford Nanopore Technologies 读段中的测序偏差如何导致可以通过多种测序技术纠正的特征性组装错误。