Cadenas-Castrejón Elizabeth, Verleyen Jérôme, Boukadida Celia, Díaz-González Lorena, Taboada Blanca
Brief Funct Genomics. 2023 Jan 20;22(1):31-41. doi: 10.1093/bfgp/elac036.
Viruses are the most abundant infectious agents on earth, and they infect living organisms such as bacteria, plants and animals, among others. They play an important role in the balance of different ecosystems by modulating microbial populations. In humans, they are responsible for some common diseases and may cause severe illnesses. Viral metagenomic studies have become essential and offer the possibility to understand and extend the knowledge of virus diversity and functionality. For these approaches, an essential step is the classification of viral sequences. In this work, 11 taxonomic classification tools were compared by analysing their performances, in terms of sensitivity and precision, to classify reads at the species and family levels using the same (viral and nonviral) datasets and evaluation metrics, as well as their processing times and memory requirements. The results showed that factors such as richness (numbers of viral species in samples), taxonomic level in the classification and read length influence tool performance. High values of viral richness in samples decreased the performances of most tools. Additionally, the classifications were better at higher taxonomic levels, such as families, compared to lower taxonomic levels, such as species, and were more evident in short reads. The results also indicated that BLAST and Kraken2 were the best tools for classifying all types of reads, while FastViromeExplorer and VirusFinder were only good when used for long reads and Centrifuge, DIAMOND, and One Codex when used for short reads. Regarding nonviral datasets (human and bacterial), all tools correctly classified them as nonviral.
病毒是地球上数量最多的感染因子,它们能感染诸如细菌、植物和动物等多种生物。病毒通过调节微生物种群,在不同生态系统的平衡中发挥着重要作用。在人类中,病毒是一些常见疾病的病因,还可能引发严重疾病。病毒宏基因组学研究变得至关重要,为理解和拓展病毒多样性及功能的知识提供了可能。对于这些研究方法而言,一个关键步骤是病毒序列的分类。在这项工作中,通过分析11种分类工具在使用相同(病毒和非病毒)数据集及评估指标对物种和科水平的读段进行分类时的敏感性和精确性表现,以及它们的处理时间和内存需求,对这些工具进行了比较。结果表明,诸如丰富度(样本中病毒物种的数量)、分类中的分类水平和读段长度等因素会影响工具的性能。样本中病毒丰富度的高值会降低大多数工具的性能。此外,与较低的分类水平(如物种)相比,在较高的分类水平(如科)上分类效果更好,且在短读段中更为明显。结果还表明,BLAST和Kraken2是对所有类型读段进行分类的最佳工具,而FastViromeExplorer和VirusFinder仅在用于长读段时表现良好,Centrifuge、DIAMOND和One Codex在用于短读段时表现良好。对于非病毒数据集(人类和细菌),所有工具都能正确地将它们分类为非病毒。