Department of Data Science in Biomedicine, Faculty of Mathematics and Computer Science, Philipps-University of Marburg, Marburg 35032, Germany.
Institute of Medical Microbiology, Justus Liebig University Giessen, Giessen 35392, Germany.
Bioinformatics. 2022 Jan 3;38(2):325-334. doi: 10.1093/bioinformatics/btab681.
Antimicrobial resistance (AMR) is one of the biggest global problems threatening human and animal health. Rapid and accurate AMR diagnostic methods are thus very urgently needed. However, traditional antimicrobial susceptibility testing (AST) is time-consuming, low throughput and viable only for cultivable bacteria. Machine learning methods may pave the way for automated AMR prediction based on genomic data of the bacteria. However, comparing different machine learning methods for the prediction of AMR based on different encodings and whole-genome sequencing data without previously known knowledge remains to be done.
In this study, we evaluated logistic regression (LR), support vector machine (SVM), random forest (RF) and convolutional neural network (CNN) for the prediction of AMR for the antibiotics ciprofloxacin, cefotaxime, ceftazidime and gentamicin. We could demonstrate that these models can effectively predict AMR with label encoding, one-hot encoding and frequency matrix chaos game representation (FCGR encoding) on whole-genome sequencing data. We trained these models on a large AMR dataset and evaluated them on an independent public dataset. Generally, RFs and CNNs perform better than LR and SVM with AUCs up to 0.96. Furthermore, we were able to identify mutations that are associated with AMR for each antibiotic.
Source code in data preparation and model training are provided at GitHub website (https://github.com/YunxiaoRen/ML-iAMR).
Supplementary data are available at Bioinformatics online.
抗菌药物耐药性(AMR)是威胁人类和动物健康的最大的全球性问题之一。因此,非常迫切需要快速和准确的 AMR 诊断方法。然而,传统的抗菌药物敏感性测试(AST)耗时、通量低,并且仅适用于可培养的细菌。机器学习方法可能为基于细菌基因组数据的自动 AMR 预测铺平道路。然而,在没有先验知识的情况下,比较基于不同编码和全基因组测序数据的不同机器学习方法用于预测 AMR 仍然有待研究。
在这项研究中,我们评估了逻辑回归(LR)、支持向量机(SVM)、随机森林(RF)和卷积神经网络(CNN)用于预测抗生素环丙沙星、头孢噻肟、头孢他啶和庆大霉素的 AMR。我们证明了这些模型可以有效地使用标签编码、独热编码和频率矩阵混沌游戏表示(FCGR 编码)对全基因组测序数据进行 AMR 预测。我们在一个大型 AMR 数据集上训练这些模型,并在一个独立的公共数据集上进行评估。一般来说,RF 和 CNN 的 AUC 高达 0.96,优于 LR 和 SVM。此外,我们能够为每种抗生素识别与 AMR 相关的突变。
数据准备和模型训练的源代码在 GitHub 网站(https://github.com/YunxiaoRen/ML-iAMR)上提供。
补充数据可在生物信息学在线获得。