Department of Population Medicine, Ontario Veterinary College, University of Guelph, 50 Stone Road East, Guelph, Ontario, Canada.
Department of Population Medicine, Ontario Veterinary College, University of Guelph, 50 Stone Road East, Guelph, Ontario, Canada; Centre for Public Health and Zoonoses, Ontario Veterinary College, University of Guelph, 50 Stone Road East, Guelph, Ontario, Canada; Centre for Advancing Responsible and Ethical Artificial Intelligence, University of Guelph, 50 Stone Road East, Guelph, Ontario, Canada.
Prev Vet Med. 2024 Dec;233:106351. doi: 10.1016/j.prevetmed.2024.106351. Epub 2024 Sep 26.
Influenza is a disease that represents both a public health and agricultural risk with pandemic potential. Among the subtypes of influenza A virus, H3 influenza virus can infect many avian and mammalian species and is therefore a virus of interest to human and veterinary public health. The primary goal of this study was to train and validate classifiers for the identification of the most likely host species using the hemagglutinin gene segment of H3 viruses. A five-step process was implemented, which included training four machine learning classifiers, testing the classifiers on the validation dataset, and further exploration of the best-performing model on three additional datasets. The gradient boosting machine classifier showed the highest host-classification accuracy with a 98.0 % (95 % CI [97.01, 98.73]) correct classification rate on an independent validation dataset. The classifications were further analyzed using the predicted probability score which highlighted sequences of particular interest. These sequences were both correctly and incorrectly classified sequences that showed considerable predicted probability for multiple hosts. This showed the potential of using these classifiers for rapid sequence classification and highlighting sequences of interest. Additionally, the classifiers were tested on a separate swine dataset composed of H3N2 sequences from 1998 to 2003 from the United States of America, and a separate canine dataset composed of canine H3N2 sequences of avian origin. These two datasets were utilized to look at the applications of predicted probability and host convergence over time. Lastly, the classifiers were used on an independent dataset of environmental sequences to explore the host identification of environmental sequences. The results of these classifiers show the potential for machine learning to be used as a host identification technique for viruses of unknown origin on a species-specific level.
流感是一种具有大流行潜力的公共卫生和农业风险疾病。在甲型流感病毒的亚型中,H3 流感病毒可以感染许多禽类和哺乳动物物种,因此是人类和兽医公共卫生关注的病毒。本研究的主要目标是使用 H3 病毒的血凝素基因片段训练和验证用于识别最可能宿主物种的分类器。实施了一个五步过程,包括训练四个机器学习分类器、在验证数据集上测试分类器,以及在另外三个数据集上进一步探索表现最佳的模型。梯度提升机分类器在独立验证数据集上的宿主分类准确率最高,达到 98.0%(95%置信区间[97.01,98.73])。使用预测概率评分对分类进行了进一步分析,突出显示了特别感兴趣的序列。这些序列是既有正确分类又有错误分类的序列,它们对多个宿主的预测概率都相当大。这表明这些分类器可用于快速序列分类和突出显示感兴趣的序列。此外,还在由美国 1998 年至 2003 年的 H3N2 序列组成的单独猪数据集和由禽类来源的犬科 H3N2 序列组成的单独犬数据集上测试了分类器。这两个数据集用于研究预测概率和宿主收敛随时间的应用。最后,将分类器用于独立的环境序列数据集,以探索环境序列的宿主识别。这些分类器的结果表明,机器学习有可能在特定物种水平上用作未知来源病毒的宿主识别技术。