Department of Microbiology, Faculty of Medicine, Chiang Mai University, Chiang Mai, 50200, Thailand.
F1000Res. 2024 May 31;13:556. doi: 10.12688/f1000research.149577.1. eCollection 2024.
Determining the appropriate computational requirements and software performance is essential for efficient genomic surveillance. The lack of standardized benchmarking complicates software selection, especially with limited resources.
We developed a containerized benchmarking pipeline to evaluate seven long-read assemblers-Canu, GoldRush, MetaFlye, Strainline, HaploDMF, iGDA, and RVHaplo-for viral haplotype reconstruction, using both simulated and experimental Oxford Nanopore sequencing data of HIV-1 and other viruses. Benchmarking was conducted on three computational systems to assess each assembler's performance, utilizing QUAST and BLASTN for quality assessment.
Our findings show that assembler choice significantly impacts assembly time, with CPU and memory usage having minimal effect. Assembler selection also influences the size of the contigs, with a minimum read length of 2,000 nucleotides required for quality assembly. A 4,000-nucleotide read length improves quality further. Canu was efficient among assemblers but not suitable for multi-strain mixtures, while GoldRush produced only consensus assemblies. Strainline and MetaFlye were suitable for metagenomic sequencing data, with Strainline requiring high memory and MetaFlye operable on low-specification machines. Among reference-based assemblers, iGDA had high error rates, RVHaplo showed the best runtime and accuracy but became ineffective with similar sequences, and HaploDMF, utilizing machine learning, had fewer errors with a slightly longer runtime.
The HIV-64148 pipeline, containerized using Docker, facilitates easy deployment and offers flexibility to select from a range of assemblers to match computational systems or study requirements. This tool aids in genome assembly and provides valuable information on HIV-1 sequences, enhancing viral evolution monitoring and understanding.
确定适当的计算要求和软件性能对于高效的基因组监测至关重要。缺乏标准化的基准测试使得软件选择变得复杂,尤其是在资源有限的情况下。
我们开发了一个容器化的基准测试管道,用于评估七种长读长组装器-Canu、GoldRush、MetaFlye、Strainline、HaploDMF、iGDA 和 RVHaplo-用于病毒单倍型重建,使用模拟和实验性的牛津纳米孔测序数据 HIV-1 和其他病毒。在三个计算系统上进行基准测试,以评估每个组装器的性能,使用 QUAST 和 BLASTN 进行质量评估。
我们的研究结果表明,组装器的选择显著影响组装时间,而 CPU 和内存使用的影响最小。组装器的选择也会影响 contigs 的大小,需要至少 2000 个核苷酸的最小读取长度才能进行高质量的组装。4000 个核苷酸的读取长度可以进一步提高质量。Canu 在组装器中效率较高,但不适合多菌株混合物,而 GoldRush 仅产生共识组装。Strainline 和 MetaFlye 适用于宏基因组测序数据,Strainline 需要高内存,MetaFlye 可在低规格机器上运行。在基于参考的组装器中,iGDA 错误率较高,RVHaplo 运行时和准确性最好,但在相似序列下效果不佳,而利用机器学习的 HaploDMF 错误较少,运行时间略长。
使用 Docker 容器化的 HIV-64148 管道便于轻松部署,并提供了从一系列组装器中进行选择的灵活性,以匹配计算系统或研究要求。该工具有助于基因组组装,并提供有关 HIV-1 序列的有价值信息,增强了病毒进化监测和理解。