用于全长病毒基因组组装的开源生物信息学流程的比较评估

Comparative Evaluation of Open-Source Bioinformatics Pipelines for Full-Length Viral Genome Assembly.

作者信息

Zsichla Levente, Zeeb Marius, Fazekas Dávid, Áy Éva, Müller Dalma, Metzner Karin J, Kouyos Roger D, Müller Viktor

机构信息

Institute of Biology, ELTE Eötvös Loránd University, 1117 Budapest, Hungary.

National Laboratory for Health Security, ELTE Eötvös Loránd University, 1117 Budapest, Hungary.

出版信息

Viruses. 2024 Nov 24;16(12):1824. doi: 10.3390/v16121824.

DOI:10.3390/v16121824

PMID:39772134

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11680378/

Abstract

The increasingly widespread application of next-generation sequencing (NGS) in clinical diagnostics and epidemiological research has generated a demand for robust, fast, automated, and user-friendly bioinformatics workflows. To guide the choice of tools for the assembly of full-length viral genomes from NGS datasets, we assessed the performance and applicability of four open-source bioinformatics pipelines (shiver-for which we created a user-friendly Dockerized version, referred to as dshiver; SmaltAlign; viral-ngs; and V-pipe) using both simulated and real-world HIV-1 paired-end short-read datasets and default settings. All four pipelines produced consensus genome assemblies with high quality metrics (genome fraction recovery, mismatch and indel rates, variant calling F1 scores) when the reference sequence used for assembly had high similarity to the analyzed sample. The shiver and SmaltAlign pipelines (but not viral-ngs and V-Pipe) also showed robust performance with more divergent samples (non-matching subtypes). With empirical datasets, SmaltAlign and viral-ngs exhibited an order of magnitude shorter runtime compared to V-Pipe and shiver. In terms of applicability, V-Pipe provides the broadest functionalities, SmaltAlign and dshiver combine user-friendliness with robustness, while the use of viral-ngs requires less computational resources compared to other pipelines. In conclusion, if a closely matched reference sequence is available, all pipelines can reliably reconstruct viral consensus genomes; therefore, differences in user-friendliness and runtime may guide the choice of the pipeline in a particular setting. If a matched reference sequence cannot be selected, we recommend shiver or SmaltAlign for robust performance. The new Dockerized version of shiver offers ease of use in addition to the accuracy and robustness of the original pipeline.

摘要

下一代测序（NGS）在临床诊断和流行病学研究中的应用日益广泛，这就产生了对强大、快速、自动化且用户友好的生物信息学工作流程的需求。为了指导从NGS数据集中组装全长病毒基因组的工具选择，我们使用模拟和真实世界的HIV-1双端短读数据集及默认设置，评估了四种开源生物信息学流程（shiver，我们为其创建了一个用户友好的Docker化版本，称为dshiver；SmaltAlign；viral-ngs；以及V-pipe）的性能和适用性。当用于组装的参考序列与分析样本具有高度相似性时，所有这四种流程都生成了具有高质量指标（基因组片段回收率、错配和插入缺失率、变异调用F1分数）的一致性基因组组装。shiver和SmaltAlign流程（但不包括viral-ngs和V-Pipe）在处理差异更大的样本（不匹配的亚型）时也表现出强大的性能。对于实证数据集，与V-Pipe和shiver相比，SmaltAlign和viral-ngs的运行时间短了一个数量级。在适用性方面，V-Pipe提供了最广泛的功能，SmaltAlign和dshiver将用户友好性与稳健性相结合，而与其他流程相比，viral-ngs的使用需要更少的计算资源。总之，如果有密切匹配的参考序列可用，所有流程都可以可靠地重建病毒一致性基因组；因此，用户友好性和运行时间的差异可能会指导在特定情况下流程的选择。如果无法选择匹配的参考序列，我们建议使用shiver或SmaltAlign以获得强大的性能。shiver的新Docker化版本除了具有原始流程的准确性和稳健性之外，还提供了易用性。