Antipov Dmitry, Rautiainen Mikko, Nurk Sergey, Walenz Brian P, Solar Steven J, Phillippy Adam M, Koren Sergey
Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA.
Institute for Molecular Medicine Finland, Helsinki Institute of Life Science, University of Helsinki, Tukholmankatu 8, Biomedicum 2, Helsinki, Finland.
Genome Res. 2025 Jun 12. doi: 10.1101/gr.280383.124.
The Telomere-to-Telomere Consortium recently finished the first truly complete sequence of a human genome. To resolve the most complex repeats, this project relied on the semimanual combination of long, accurate Pacific Biosciences (PacBio) HiFi and ultralong Oxford Nanopore Technologies sequencing reads. The Verkko assembler later automated this process, achieving complete assemblies for approximately half of the chromosomes in a diploid human genome. However, the first version of Verkko was computationally expensive and could not resolve all regions of a typical human genome. Here we present Verkko2, which implements a more efficient read correction algorithm, improves repeat resolution and gap closing, introduces proximity-ligation-based haplotype phasing and scaffolding, and adds support for multiple long-read data types. These enhancements allow Verkko2 to assemble all regions of a diploid human genome, including the short arms of the acrocentric chromosomes and both sex chromosomes. Together, these changes increase the number of telomere-to-telomere scaffolds by twofold, reduce runtime by fourfold, and improve assembly correctness. On a panel of 19 human genomes, Verkko2 assembles an average of 39 of 46 complete chromosomes as scaffolds, with 21 of these assembled as gapless contigs. Together, these improvements enable telomere-to-telomere comparative genomics and pangenomics, at scale.
端粒到端粒联盟最近完成了首个真正完整的人类基因组序列。为了解决最复杂的重复序列,该项目依赖于长读长、高精度的太平洋生物科学公司(PacBio)的HiFi测序和超长牛津纳米孔技术测序读数的半手动组合。Verkko组装器后来实现了这一过程的自动化,在二倍体人类基因组中,约一半的染色体实现了完整组装。然而,Verkko的第一个版本计算成本高昂,无法解析典型人类基因组的所有区域。在此,我们展示了Verkko2,它实现了一种更高效的读段校正算法,提高了重复序列解析和缺口闭合能力,引入了基于邻近连接的单倍型定相和支架构建,并增加了对多种长读长数据类型的支持。这些改进使得Verkko2能够组装二倍体人类基因组的所有区域,包括近端着丝粒染色体的短臂和两条性染色体。这些变化共同使端粒到端粒的支架数量增加了两倍,运行时间减少了四倍,并提高了组装的正确性。在一组19个人类基因组上,Verkko2平均将46条完整染色体中的39条组装为支架,其中21条组装为无缺口的重叠群。这些改进共同实现了大规模的端粒到端粒比较基因组学和泛基因组学。