Systems and Computing Engineering Department, Universidad de los Andes, Bogotá, Colombia.
Methods Mol Biol. 2023;2590:273-286. doi: 10.1007/978-1-0716-2819-5_16.
The ultimate goal of de novo assembly of reads sequenced from a diploid individual is the separate reconstruction of the sequences corresponding to the two copies of each chromosome. Unfortunately, the allele linkage information needed to perform phased genome assemblies has been difficult to generate. Hence, most current genome assemblies are a haploid mixture of the two underlying chromosome copies present in the sequenced individual. Sequencing technologies providing long (20 kb) and accurate reads are the basis to generate phased genome assemblies. This chapter provides a brief overview of the main milestones in traditional genome assembly, focusing on the bioinformatic techniques developed to generate haplotype information from different specialized protocols. Using these techniques as a knowledge background, the chapter reviews the current algorithms to generate phased assemblies from long reads with low error rates. Current techniques perform haplotype-aware error correction steps to increase the quality of the raw reads. In addition, variations on the traditional overlap-layout-consensus (OLC) graph have been developed in an effort to eliminate edges between reads sequenced from different chromosome copies. This allows for large presence-absence variants between the chromosome copies to be taken into account. The development of these algorithms, along with the improved sequencing technologies has been crucial to finish chromosome-level assemblies of complex genomes.
从二倍体个体中测序得到的reads 进行从头组装的最终目标是分别重建每个染色体的两个拷贝对应的序列。不幸的是,进行相组装所需的等位基因连锁信息很难生成。因此,大多数当前的基因组组装是测序个体中存在的两个潜在染色体拷贝的单倍体混合物。提供长(20kb)和准确读取的测序技术是生成相基因组组装的基础。本章简要概述了传统基因组组装的主要里程碑,重点介绍了为从不同专门协议生成单倍型信息而开发的生物信息学技术。使用这些技术作为知识背景,本章回顾了从具有低错误率的长reads 生成相组装的当前算法。当前的技术执行单倍型感知纠错步骤以提高原始reads 的质量。此外,还开发了传统重叠布局共识(OLC)图的变体,以努力消除来自不同染色体拷贝的reads 之间的边缘。这允许考虑染色体拷贝之间的大存在缺失变体。这些算法的发展以及改进的测序技术对于完成复杂基因组的染色体水平组装至关重要。