Division of Data Science and Learning, Argonne National Laboratory, Argonne Illinois, United States of America.
Consortium for Advanced Science and Engineering, University of Chicago, Chicago, Illinois, United States of America.
PLoS Comput Biol. 2020 Oct 19;16(10):e1008319. doi: 10.1371/journal.pcbi.1008319. eCollection 2020 Oct.
A growing number of studies are using machine learning models to accurately predict antimicrobial resistance (AMR) phenotypes from bacterial sequence data. Although these studies are showing promise, the models are typically trained using features derived from comprehensive sets of AMR genes or whole genome sequences and may not be suitable for use when genomes are incomplete. In this study, we explore the possibility of predicting AMR phenotypes using incomplete genome sequence data. Models were built from small sets of randomly-selected core genes after removing the AMR genes. For Klebsiella pneumoniae, Mycobacterium tuberculosis, Salmonella enterica, and Staphylococcus aureus, we report that it is possible to classify susceptible and resistant phenotypes with average F1 scores ranging from 0.80-0.89 with as few as 100 conserved non-AMR genes, with very major error rates ranging from 0.11-0.23 and major error rates ranging from 0.10-0.20. Models built from core genes have predictive power in cases where the primary AMR mechanisms result from SNPs or horizontal gene transfer. By randomly sampling non-overlapping sets of core genes, we show that F1 scores and error rates are stable and have little variance between replicates. Although these small core gene models have lower accuracies and higher error rates than models built from the corresponding assembled genomes, the results suggest that sufficient variation exists in the core non-AMR genes of a species for predicting AMR phenotypes.
越来越多的研究正在使用机器学习模型,根据细菌序列数据准确预测抗生素耐药性 (AMR) 表型。尽管这些研究显示出了前景,但这些模型通常是使用源自 AMR 基因或全基因组序列的综合集的特征进行训练的,在基因组不完整的情况下可能不适用。在本研究中,我们探讨了使用不完整的基因组序列数据预测 AMR 表型的可能性。在去除 AMR 基因后,从随机选择的核心基因小集中构建了模型。对于肺炎克雷伯菌、结核分枝杆菌、肠炎沙门氏菌和金黄色葡萄球菌,我们报告说,使用 100 个左右的保守非 AMR 基因,平均 F1 分数可达到 0.80-0.89,非常大错误率范围为 0.11-0.23,主要错误率范围为 0.10-0.20,有可能对敏感和耐药表型进行分类。对于主要 AMR 机制源自 SNPs 或水平基因转移的情况,核心基因构建的模型具有预测能力。通过随机抽样非重叠的核心基因集,我们表明 F1 分数和错误率稳定,在重复之间变化很小。尽管这些小型核心基因模型的准确性较低,错误率较高,但结果表明,在物种的核心非 AMR 基因中存在足够的变异性,可用于预测 AMR 表型。