Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China.
College of Informatics, Huazhong Agricultural University, Wuhan 430070, China.
Genome Res. 2024 Mar 20;34(2):326-340. doi: 10.1101/gr.278232.123.
Pacific Biosciences (PacBio) HiFi sequencing technology generates long reads (>10 kbp) with very high accuracy (<0.01% sequencing error). Although several de novo assembly tools are available for HiFi reads, there are no comprehensive studies on the evaluation of these assemblers. We evaluated the performance of 11 de novo HiFi assemblers on (1) real data for three eukaryotic genomes; (2) 34 synthetic data sets with different ploidy, sequencing coverage levels, heterozygosity rates, and sequencing error rates; (3) one real metagenomic data set; and (4) five synthetic metagenomic data sets with different composition abundance and heterozygosity rates. The 11 assemblers were evaluated using quality assessment tool (QUAST) and benchmarking universal single-copy ortholog (BUSCO). We also used several additional criteria, namely, completion rate, single-copy completion rate, duplicated completion rate, average proportion of largest category, average distance difference, quality value, run-time, and memory utilization. Results show that hifiasm and hifiasm-meta should be the first choice for assembling eukaryotic genomes and metagenomes with HiFi data. We performed a comprehensive benchmarking study of commonly used assemblers on complex eukaryotic genomes and metagenomes. Our study will help the research community to choose the most appropriate assembler for their data and identify possible improvements in assembly algorithms.
太平洋生物科学(PacBio)HiFi 测序技术可生成超长读长(>10 kbp),且具有极高的准确性(<0.01%测序错误率)。虽然有几种从头组装工具可用于 HiFi 读长,但针对这些组装程序的综合评估研究还很少。我们评估了 11 种从头 HiFi 组装程序在以下方面的性能:(1)三个真核生物基因组的实际数据;(2)34 个具有不同倍性、测序覆盖度水平、杂合率和测序错误率的合成数据集;(3)一个真实的宏基因组数据集;(4)五个具有不同组成丰度和杂合率的合成宏基因组数据集。使用质量评估工具(QUAST)和基准通用单拷贝同源物(BUSCO)对 11 个组装程序进行了评估。我们还使用了其他几个标准,包括完成率、单拷贝完成率、重复完成率、最大类别比例平均值、平均距离差异、质量值、运行时间和内存利用率。结果表明,对于用 HiFi 数据组装真核生物基因组和宏基因组,hifiasm 和 hifiasm-meta 应该是首选。我们对复杂真核生物基因组和宏基因组上常用的组装程序进行了全面的基准测试研究。我们的研究将有助于研究社区为其数据选择最合适的组装程序,并确定组装算法的可能改进。