Ares Genetics GmbH, Vienna, Austria.
Division of Computational Systems Biology, Department of Microbiology and Ecosystem Science, University of Vienna, Vienna, Austria.
Front Cell Infect Microbiol. 2021 Feb 15;11:610348. doi: 10.3389/fcimb.2021.610348. eCollection 2021.
Antimicrobial resistance prediction from whole genome sequencing data (WGS) is an emerging application of machine learning, promising to improve antimicrobial resistance surveillance and outbreak monitoring. Despite significant reductions in sequencing cost, the availability and sampling diversity of WGS data with matched antimicrobial susceptibility testing (AST) profiles required for training of WGS-AST prediction models remains limited. Best practice machine learning techniques are required to ensure trained models generalize to independent data for optimal predictive performance. Limited data restricts the choice of machine learning training and evaluation methods and can result in overestimation of model performance. We demonstrate that the widely used random k-fold cross-validation method is ill-suited for application to small bacterial genomics datasets and offer an alternative cross-validation method based on genomic distance. We benchmarked three machine learning architectures previously applied to the WGS-AST problem on a set of 8,704 genome assemblies from five clinically relevant pathogens across 77 species-compound combinations collated from public databases. We show that individual models can be effectively ensembled to improve model performance. By combining models stacked generalization with cross-validation, a model ensembling technique suitable for small datasets, we improved average sensitivity and specificity of individual models by 1.77% and 3.20%, respectively. Furthermore, stacked models exhibited improved robustness and were thus less prone to outlier performance drops than individual component models. In this study, we highlight best practice techniques for antimicrobial resistance prediction from WGS data and introduce the combination of genome distance aware cross-validation and stacked generalization for robust and accurate WGS-AST.
从全基因组测序数据 (WGS) 预测抗菌药物耐药性是机器学习的一项新兴应用,有望改善抗菌药物耐药性监测和爆发监测。尽管测序成本显著降低,但用于训练 WGS-AST 预测模型的具有匹配抗菌药物敏感性测试 (AST) 谱的 WGS 数据的可用性和采样多样性仍然有限。需要采用最佳实践机器学习技术来确保训练的模型能够推广到独立数据,以实现最佳预测性能。数据有限限制了机器学习训练和评估方法的选择,并可能导致模型性能的高估。我们证明了广泛使用的随机 k 折交叉验证方法不适用于小型细菌基因组数据集,并提出了一种基于基因组距离的替代交叉验证方法。我们在一组来自五个临床相关病原体的 8704 个基因组组装体上对以前应用于 WGS-AST 问题的三种机器学习架构进行了基准测试,这些基因组组装体来自公共数据库中汇集的 77 个种属组合。我们表明,可以有效地对个体模型进行集成以提高模型性能。通过对模型进行堆叠泛化和交叉验证,即一种适用于小数据集的模型集成技术,我们将单个模型的平均灵敏度和特异性分别提高了 1.77%和 3.20%。此外,堆叠模型表现出更高的稳健性,因此比单个组成模型更不容易出现异常性能下降。在这项研究中,我们强调了从 WGS 数据预测抗菌药物耐药性的最佳实践技术,并介绍了基于基因组距离的交叉验证和堆叠泛化的组合,以实现稳健和准确的 WGS-AST。