Babirye Sandra Ruth, Nsubuga Mike, Mboowa Gerald, Batte Charles, Galiwango Ronald, Kateete David Patrick
Department of Immunology and Molecular Biology, School of Biomedical Sciences, College of Health Sciences, Makerere University, P.O. Box 7072, Kampala, Uganda.
The African Center of Excellence in Bioinformatics and Data-Intensive Science (ACE), Kampala, Uganda.
BMC Infect Dis. 2024 Dec 5;24(1):1391. doi: 10.1186/s12879-024-10282-7.
Efforts toward tuberculosis management and control are challenged by the emergence of Mycobacterium tuberculosis (MTB) resistance to existing anti-TB drugs. This study aimed to explore the potential of machine learning algorithms in predicting drug resistance of four anti-TB drugs (rifampicin, isoniazid, streptomycin, and ethambutol) in MTB using whole-genome sequence and clinical data from Uganda. We also assessed the model's generalizability on another dataset from South Africa.
We trained ten machine learning algorithms on a dataset comprising of 182 MTB isolates with clinical data variables (age, sex, HIV status) and SNP mutations across the entire genome as predictor variables and phenotypic drug-susceptibility data for the four drugs as the outcome variable. Model performance varied across the four anti-TB drugs after a five-fold cross validation. The best model was selected considering the highest Mathews Correlation Coefficient (MCC) and Area Under the Receiver Operating Characteristic Curve (AUC) score as key metrics. The Logistic regression excelled in predicting rifampicin resistance (MCC: 0.83 (95% confidence intervals (CI) 0.73-0.86) and AUC: 0.96 (95% CI 0.95-0.98) and streptomycin (MCC: 0.44 (95% CI 0.27-0.58) and AUC: 0.80 (95% CI 0.74-0.82), Extreme Gradient Boosting (XGBoost) for ethambutol (MCC: 0.65 (95% CI 0.54-0.74) and AUC: 0.90 (95% CI 0.83-0.96) and Gradient Boosting (GBC) for isoniazid (MCC: 0.69 (95% CI 0.61-0.78) and AUC: 0.91 (95% CI 0.88-0.96). The best performing model per drug was only trained on the SNP dataset after excluding the clinical data variables because intergrating them with SNP mutations showed a marginal improvement in the model's performance. Despite the high MCC (0.18 to 0.72) and AUC (0.66 to 0.95) scores for all the best models with the Uganda test dataset, LR model for rifampicin and streptomycin didn't generalize with the South Africa dataset compared to the GBC and XGBoost models. Compared to TB profiler, LR for RIF was very sensitive and the GBC for INH and XGBoost for EMB were very specific on the Uganda dataset. TB profiler outperformed all the best models on the South Africa dataset. We identified key mutations associated with drug resistance for these antibiotics. HIV status was also identified among the top significant features in predicting drug resistance.
Leveraging machine learning applications in predicting antimicrobial resistance represents a promising avenue in addressing the global health challenge posed by antimicrobial resistance. This work demonstrates that integration of diverse data types such as genomic and clinical data could improve resistance predictions while using machine learning algorithms, support robust surveillance systems and also inform targeted interventions to curb the rising threat of antimicrobial resistance.
结核分枝杆菌(MTB)对现有抗结核药物产生耐药性,给结核病的管理和控制带来了挑战。本研究旨在利用乌干达的全基因组序列和临床数据,探索机器学习算法在预测MTB对四种抗结核药物(利福平、异烟肼、链霉素和乙胺丁醇)耐药性方面的潜力。我们还评估了该模型在来自南非的另一个数据集上的泛化能力。
我们在一个包含182株MTB分离株的数据集上训练了十种机器学习算法,该数据集将临床数据变量(年龄、性别、HIV状态)和全基因组的单核苷酸多态性(SNP)突变作为预测变量,将四种药物的表型药敏数据作为结果变量。经过五折交叉验证后,四种抗结核药物的模型性能各不相同。以最高的马修斯相关系数(MCC)和受试者工作特征曲线下面积(AUC)得分作为关键指标,选出了最佳模型。逻辑回归在预测利福平耐药性方面表现出色(MCC:0.83(95%置信区间(CI)0.73 - 0.86),AUC:0.96(95%CI 0.95 - 0.98))以及链霉素耐药性(MCC:0.44(95%CI 0.27 - 0.58),AUC:0.80(95%CI 0.74 - 0.82));极端梯度提升(XGBoost)算法在预测乙胺丁醇耐药性方面表现较好(MCC:0.65(95%CI 0.54 - 0.74),AUC:0.90(95%CI 0.83 - 0.96));梯度提升(GBC)算法在预测异烟肼耐药性方面表现较好(MCC:0.69(95%CI 0.61 - 0.78),AUC:0.91(95%CI 0.88 - 0.96))。每种药物表现最佳的模型仅在排除临床数据变量后在SNP数据集上进行了训练,因为将临床数据变量与SNP突变整合后,模型性能仅略有改善。尽管所有最佳模型在乌干达测试数据集上的MCC(0.18至0.72)和AUC(0.66至0.95)得分较高,但与GBC和XGBoost模型相比,利福平和链霉素的逻辑回归模型在南非数据集上未能实现泛化。与结核菌素试验相比,利福平的逻辑回归在乌干达数据集上非常敏感,异烟肼的GBC和乙胺丁醇的XGBoost非常特异。在南非数据集上,结核菌素试验的表现优于所有最佳模型。我们确定了与这些抗生素耐药性相关的关键突变。HIV状态也被确定为预测耐药性的最重要特征之一。
利用机器学习应用预测抗菌药物耐药性是应对抗菌药物耐药性带来的全球健康挑战的一条有前途的途径。这项工作表明,整合基因组和临床数据等多种数据类型可以在使用机器学习算法时提高耐药性预测能力,支持强大的监测系统,并为针对性干预措施提供依据,以遏制抗菌药物耐药性不断上升的威胁。