Arjmand A, Bani-Yaghoub M, Sutkin G, Corkran K, Paschal S
Division of Computing, Analytics and Mathematics, School of Science and Engineering, University of Missouri-Kansas City, Kansas City, Kansas, USA.
Division of Computing, Analytics and Mathematics, School of Science and Engineering, University of Missouri-Kansas City, Kansas City, Kansas, USA.
J Hosp Infect. 2025 Aug;162:263-271. doi: 10.1016/j.jhin.2025.04.024. Epub 2025 May 6.
Differentiating between community-associated (CA-) and healthcare-associated (HA-) urinary tract infection (UTI) is crucial for understanding their epidemiology, identifying risk factors, and developing appropriate treatment strategies.
To build, validate, and compare machine learning models: Decision Tree, Neural Network, Logistic Regression, Random Forest, and Extreme Gradient Boosting to differentiate between the incidences of HA-UTI and CA-UTI; additionally, to identify key predictors of UTI using demographic, hospital, and socioeconomic variables.
Patient demographics, hospital, and socioeconomic data from 2019 to 2023 were analysed.
The Decision Tree model demonstrated the highest sensitivity, particularly in handling the highly imbalanced data of HAI, with a sensitivity of 87%. Logistic Regression achieved the best overall accuracy, at 95.9% for distinguishing HA-UTI from UTI-free and 93.2% for HA-UTI vs CA-UTI. Random Forest performed best in cross-validation, reaching 99.1% for HA-UTI vs UTI free and 96.2% for HA-UTI vs CA-UTI. Neural Network showed the highest specificity, at 93.4%, for HA-UTI vs CA-UTI. The area-under-the-curve values further supported these findings, ranging from 71.9% for Neural Network to 94% for Random Forest, reflecting the robustness of these models across different annual datasets. Among patient demographics, hospital, and socioeconomic variables, all models consistently identified the Nurse Units (e.g. Inpatient Units and Mental Health Units) as the most significant predictors of UTI. In addition to Nurse Units, Logistic Regression and Decision Tree identified location (e.g. various clinics and medical centres) as a key predictor.
The machine learning models demonstrated comparable overall accuracy, but differed in sensitivity and specificity across the two classification tasks-HA-UTI vs CA-UTI and HA-UTI vs UTI-free. Nurse Units consistently emerge as the most significant predictors across all years. The importance of all predictors varies from year to year.
区分社区获得性(CA-)和医疗保健相关(HA-)尿路感染(UTI)对于了解其流行病学、识别风险因素以及制定适当的治疗策略至关重要。
构建、验证和比较机器学习模型:决策树、神经网络、逻辑回归、随机森林和极端梯度提升,以区分HA-UTI和CA-UTI的发病率;此外,使用人口统计学、医院和社会经济变量识别UTI的关键预测因素。
分析了2019年至2023年的患者人口统计学、医院和社会经济数据。
决策树模型表现出最高的敏感性,特别是在处理HAI高度不平衡的数据时,敏感性为87%。逻辑回归实现了最佳的总体准确率,区分HA-UTI与无UTI时为95.9%,区分HA-UTI与CA-UTI时为93.2%。随机森林在交叉验证中表现最佳,区分HA-UTI与无UTI时达到99.1%,区分HA-UTI与CA-UTI时为96.2%。神经网络在区分HA-UTI与CA-UTI时表现出最高的特异性,为93.4%。曲线下面积值进一步支持了这些发现,范围从神经网络的71.9%到随机森林的94%,反映了这些模型在不同年度数据集中的稳健性。在患者人口统计学、医院和社会经济变量中,所有模型一致将护士单元(如住院单元和精神卫生单元)确定为UTI最显著的预测因素。除护士单元外,逻辑回归和决策树还将地点(如各种诊所和医疗中心)确定为关键预测因素。
机器学习模型在总体准确率上表现相当,但在HA-UTI与CA-UTI以及HA-UTI与无UTI这两个分类任务中的敏感性和特异性有所不同。护士单元在所有年份中始终是最显著的预测因素。所有预测因素的重要性因年份而异。