Ha Anh D, Aylward Frank O
Department of Biological Sciences, Virginia Tech, Blacksburg VA, 24061.
Center for Emerging, Zoonotic, and Arthropod-Borne Infectious Disease, Virginia Tech, Blacksburg VA, 24061.
bioRxiv. 2023 Nov 13:2023.11.10.566645. doi: 10.1101/2023.11.10.566645.
Viruses of the phylum , often referred to as "giant viruses," are prevalent in various environments around the globe and play significant roles in shaping eukaryotic diversity and activities in global ecosystems. Given the extensive phylogenetic diversity within this viral group and the highly complex composition of their genomes, taxonomic classification of giant viruses, particularly incomplete metagenome-assembled genomes (MAGs) can present a considerable challenge. Here we developed TIGTOG (Taxonomic Information of Giant viruses using Trademark Orthologous Groups), a machine learning-based approach to predict the taxonomic classification of novel giant virus MAGs based on profiles of protein family content. We applied a random forest algorithm to a training set of 1,531 quality-checked, phylogenetically diverse genomes using pre-selected sets of giant virus orthologous groups (GVOGs). The classification models were predictive of viral taxonomic assignments with a cross-validation accuracy of 99.6% to the order level and 97.3% to the family level. We found that no individual GVOGs or genome features significantly influenced the algorithm's performance or the models' predictions, indicating that classification predictions were based on a comprehensive genomic signature, which reduced the necessity of a fixed set of marker genes for taxonomic assigning purposes. Our classification models were validated with an independent test set of 823 giant virus genomes with varied genomic completeness and taxonomy and demonstrated an accuracy of 98.6% and 95.9% to the order and family level, respectively. Our results indicate that protein family profiles can be used to accurately classify large DNA viruses at different taxonomic levels and provide a fast and accurate method for the classification of giant viruses. This approach could easily be adapted to other viral groups.
病毒门的病毒,常被称为“巨型病毒”,在全球各种环境中普遍存在,并在塑造真核生物多样性以及全球生态系统中的活动方面发挥着重要作用。鉴于该病毒群体内广泛的系统发育多样性及其基因组的高度复杂组成,巨型病毒的分类,尤其是不完整的宏基因组组装基因组(MAG)的分类可能是一项相当大的挑战。在此,我们开发了TIGTOG(使用商标直系同源组的巨型病毒分类信息),这是一种基于机器学习的方法,用于根据蛋白质家族含量概况预测新型巨型病毒MAG的分类。我们将随机森林算法应用于一组经过质量检查、系统发育多样的1531个基因组的训练集,使用预先选择的巨型病毒直系同源组(GVOG)。分类模型对病毒分类归属具有预测性,交叉验证准确率在目水平为99.6%,在科水平为97.3%。我们发现,没有单个GVOG或基因组特征会显著影响算法性能或模型预测,这表明分类预测基于综合的基因组特征,从而减少了用于分类目的的固定标记基因集的必要性。我们的分类模型用一组包含823个具有不同基因组完整性和分类的巨型病毒基因组的独立测试集进行了验证,在目水平和科水平的准确率分别为98.6%和95.9%。我们的结果表明,蛋白质家族概况可用于在不同分类水平上准确分类大型DNA病毒,并为巨型病毒的分类提供一种快速准确的方法。这种方法可以很容易地应用于其他病毒群体。