Service Evolution Biologique et Ecologie, Université libre de Bruxelles (ULB), Avenue Franklin D. Roosevelt 50, 1050, Brussels, Belgium.
Laboratoire d'Ecologie et Génétique Evolutive, Université de Namur, Rue de Bruxelles 61, 5000, Namur, Belgium.
BMC Bioinformatics. 2021 Jun 5;22(1):303. doi: 10.1186/s12859-021-04118-3.
Long-read sequencing is revolutionizing genome assembly: as PacBio and Nanopore technologies become more accessible in technicity and in cost, long-read assemblers flourish and are starting to deliver chromosome-level assemblies. However, these long reads are usually error-prone, making the generation of a haploid reference out of a diploid genome a difficult enterprise. Failure to properly collapse haplotypes results in fragmented and structurally incorrect assemblies and wreaks havoc on orthology inference pipelines, yet this serious issue is rarely acknowledged and dealt with in genomic projects, and an independent, comparative benchmark of the capacity of assemblers and post-processing tools to properly collapse or purge haplotypes is still lacking.
We tested different assembly strategies on the genome of the rotifer Adineta vaga, a non-model organism for which high coverages of both PacBio and Nanopore reads were available. The assemblers we tested (Canu, Flye, NextDenovo, Ra, Raven, Shasta and wtdbg2) exhibited strikingly different behaviors when dealing with highly heterozygous regions, resulting in variable amounts of uncollapsed haplotypes. Filtering reads generally improved haploid assemblies, and we also benchmarked three post-processing tools aimed at detecting and purging uncollapsed haplotypes in long-read assemblies: HaploMerger2, purge_haplotigs and purge_dups.
We provide a thorough evaluation of popular assemblers on a non-model eukaryote genome with variable levels of heterozygosity. Our study highlights several strategies using pre and post-processing approaches to generate haploid assemblies with high continuity and completeness. This benchmark will help users to improve haploid assemblies of non-model organisms, and evaluate the quality of their own assemblies.
长读测序正在彻底改变基因组组装:随着 PacBio 和 Nanopore 技术在技术和成本上变得更加容易获得,长读序列组装器蓬勃发展,开始提供染色体水平的组装。然而,这些长读通常容易出错,使得从二倍体基因组中生成单倍体参考成为一项艰巨的任务。如果不能正确地折叠单倍型,就会导致组装片段化和结构不正确,并破坏同源性推断管道,但这个严重的问题在基因组项目中很少得到承认和处理,而且仍然缺乏对组装器和后处理工具正确折叠或清除单倍型能力的独立、比较基准。
我们在轮虫 Adineta vaga 的基因组上测试了不同的组装策略,轮虫是一种非模式生物,我们可以获得其PacBio 和 Nanopore reads 的高覆盖率。我们测试的组装器(Canu、Flye、NextDenovo、Ra、Raven、Shasta 和 wtdbg2)在处理高度杂合区域时表现出明显不同的行为,导致未折叠单倍型的数量不同。过滤 reads 通常可以改善单倍体组装,我们还对三种旨在检测和清除长读组装中单倍型的后处理工具进行了基准测试:HaploMerger2、purge_haplotigs 和 purge_dups。
我们在具有不同杂合水平的非模式真核生物基因组上对流行的组装器进行了全面评估。我们的研究强调了几种使用预和后处理方法的策略,以生成具有高连续性和完整性的单倍体组装。该基准将帮助用户改进非模式生物的单倍体组装,并评估他们自己的组装质量。