College of Biology, Hunan University, Changsha, China.
College of Computer Science and Electronic Engineering, Hunan University, Changsha, China.
Transbound Emerg Dis. 2019 Nov;66(6):2517-2522. doi: 10.1111/tbed.13314. Epub 2019 Aug 12.
Viruses have caused much mortality and morbidity to humans and pose a serious threat to global public health. The virome with the potential of human infection is still far from complete. Novel viruses have been discovered at an unprecedented pace as the rapid development of viral metagenomics. However, there is still a lack of methodology for rapidly identifying novel viruses with the potential of human infection. This study built several machine learning models to discriminate human-infecting viruses from other viruses based on the frequency of k-mers in the viral genomic sequences. The k-nearest neighbor (KNN) model can predict the human-infecting viruses with an accuracy of over 90%. The performance of this KNN model built on the short contigs (≥1 kb) is comparable to those built on the viral genomes. We used a reported human blood virome to further validate this KNN model with an accuracy of over 80% based on very short raw reads (150 bp). Our work demonstrates a conceptual and generic protocol for the discovery of novel human-infecting viruses in viral metagenomics studies.
病毒给人类带来了大量的死亡和发病,并对全球公共卫生构成了严重威胁。具有感染人类潜力的病毒组仍远未完成。随着病毒宏基因组学的快速发展,新的病毒以前所未有的速度被发现。然而,仍然缺乏一种快速识别具有感染人类潜力的新病毒的方法。本研究建立了几种机器学习模型,基于病毒基因组序列中 k-mer 的频率来区分感染人类的病毒和其他病毒。k-最近邻 (KNN) 模型可以预测感染人类的病毒,准确率超过 90%。该 KNN 模型在短序列(≥1 kb)上的性能与在病毒基因组上构建的模型相当。我们使用已报道的人类血液病毒组,基于非常短的原始读数(150 bp),进一步验证了该 KNN 模型的准确率超过 80%。我们的工作展示了一种在病毒宏基因组学研究中发现新型感染人类病毒的概念性和通用方案。