Alipour Fatemeh, Holmes Connor, Lu Yang Young, Hill Kathleen A, Kari Lila
School of Computer Science, University of Waterloo, Waterloo, ON, Canada.
Department of Biology, University of Western Ontario, London, ON, Canada.
Front Mol Biosci. 2024 Jan 11;10:1305506. doi: 10.3389/fmolb.2023.1305506. eCollection 2023.
Astroviruses are a family of genetically diverse viruses associated with disease in humans and birds with significant health effects and economic burdens. Astrovirus taxonomic classification includes two genera, and However, with next-generation sequencing, broader interspecies transmission has been observed necessitating a reexamination of the current host-based taxonomic classification approach. In this study, a novel taxonomic classification method is presented for emergent and as yet unclassified astroviruses, based on whole genome sequence -mer composition in addition to host information. An optional component responsible for identifying recombinant sequences was added to the method's pipeline, to counteract the impact of genetic recombination on viral classification. The proposed three-pronged classification method consists of a supervised machine learning method, an unsupervised machine learning method, and the consideration of host species. Using this three-pronged approach, we propose genus labels for 191 as yet unclassified astrovirus genomes. Genus labels are also suggested for an additional eight as yet unclassified astrovirus genomes for which incompatibility was observed with the host species, suggesting cross-species infection. Lastly, our machine learning-based approach augmented by a principal component analysis (PCA) analysis provides evidence supporting the hypothesis of the existence of human astrovirus () subgenus of the genus , and a goose astrovirus () subgenus of the genus . Overall, this multipronged machine learning approach provides a fast, reliable, and scalable prediction method of taxonomic labels, able to keep pace with emerging viruses and the exponential increase in the output of modern genome sequencing technologies.
星状病毒是一类基因多样的病毒,与人类和鸟类的疾病相关,具有重大的健康影响和经济负担。星状病毒的分类学包括两个属,然而,随着下一代测序技术的应用,人们观察到更广泛的种间传播,这使得有必要重新审视当前基于宿主的分类学分类方法。在本研究中,提出了一种新的分类学分类方法,用于对新出现的和尚未分类的星状病毒进行分类,该方法除了考虑宿主信息外,还基于全基因组序列的k-mer组成。该方法的流程中增加了一个负责识别重组序列的可选组件,以抵消基因重组对病毒分类的影响。所提出的三管齐下的分类方法包括一种监督机器学习方法、一种无监督机器学习方法以及对宿主物种的考虑。使用这种三管齐下的方法,我们为191个尚未分类的星状病毒基因组提出了属标签。对于另外8个尚未分类的星状病毒基因组,由于观察到与宿主物种不兼容,表明存在跨物种感染,我们也提出了属标签。最后,我们基于机器学习并通过主成分分析(PCA)增强的方法提供了证据,支持存在人星状病毒属的人星状病毒(HAstV)亚属和鹅星状病毒属的鹅星状病毒(GAstV)亚属的假设。总体而言,这种多管齐下的机器学习方法提供了一种快速、可靠且可扩展的分类标签预测方法,能够跟上新出现病毒的步伐以及现代基因组测序技术产出的指数级增长。