Department of Computational and Data Sciences, Indian Institute of Science, Bangalore 560012, India.
Department of Computational and Data Sciences, Indian Institute of Science, Bangalore 560012, India
Genome Res. 2024 Nov 20;34(11):1908-1918. doi: 10.1101/gr.279311.124.
Automated telomere-to-telomere (T2T) de novo assembly of diploid and polyploid genomes remains a formidable task. A string graph is a commonly used assembly graph representation in the assembly algorithms. The string graph formulation employs graph simplification heuristics, which drastically reduce the count of vertices and edges. One of these heuristics involves removing the reads contained in longer reads. In practice, this heuristic occasionally introduces gaps in the assembly by removing all reads that cover one or more genome intervals. The factors contributing to such gaps remain poorly understood. In this work, we mathematically derived the frequency of observing a gap near a germline and a somatic heterozygous variant locus. Our analysis shows that (1) an assembly gap due to contained read deletion is an order of magnitude more frequent in Oxford Nanopore Technologies (ONT) reads than Pacific Biosciences high-fidelity (PacBio HiFi) reads due to differences in their read-length distributions, and (2) this frequency decreases with an increase in the sequencing depth. Drawing cues from these observations, we addressed the weakness of the string graph formulation by developing the repeat-aware fragmenting tool (RAFT) assembly algorithm. RAFT addresses the issue of contained reads by fragmenting reads and producing a more uniform read-length distribution. The algorithm retains spanned repeats in the reads during the fragmentation. We empirically demonstrate that RAFT significantly reduces the number of gaps using simulated data sets. Using real ONT and PacBio HiFi data sets of the HG002 human genome, we achieved a twofold increase in the contig NG50 and the number of haplotype-resolved T2T contigs compared to hifiasm.
自动化端到端(T2T)从头组装二倍体和多倍体基因组仍然是一项艰巨的任务。串图是组装算法中常用的组装图表示形式。串图公式采用图简化启发式算法,极大地减少了顶点和边的数量。其中一种启发式算法涉及删除包含在较长读段中的读段。在实践中,这种启发式算法偶尔会通过删除覆盖一个或多个基因组区间的所有读段,在组装中引入间隙。导致这种间隙的因素仍未得到很好的理解。在这项工作中,我们从数学上推导出了在生殖系和体细胞杂合变异位点附近观察到间隙的频率。我们的分析表明:(1) 由于包含的读段删除导致的组装间隙在牛津纳米孔技术(ONT)读段中比太平洋生物科学高保真度(PacBio HiFi)读段更为常见,这是由于它们的读段长度分布不同,这种差异在数量级上;(2) 这种频率随着测序深度的增加而降低。根据这些观察结果,我们通过开发重复感知分段工具(RAFT)组装算法来解决串图公式的弱点。RAFT 通过分段读段并产生更均匀的读段长度分布来解决包含读段的问题。该算法在分段过程中保留读段中的跨越重复。我们通过模拟数据集经验证明,RAFT 显著减少了间隙数量。使用真实的 ONT 和 PacBio HiFi 人类 HG002 基因组数据集,与 hifiasm 相比,我们实现了 contig NG50 和单倍型解析 T2T contig 的数量增加了一倍。