Savalli Carine, Wichmann Roberta Moreira, Filho Fabiano Barcellos, Fernandes Fernando Timoteo, Filho Alexandre Dias Porto Chiavegatto
Federal University of São Paulo, Department of Public Politics and Public Health, Santos, Brazil.
School of Public Health, University of São Paulo, São Paulo, Brazil.
PLOS Digit Health. 2024 Dec 26;3(12):e0000699. doi: 10.1371/journal.pdig.0000699. eCollection 2024 Dec.
Machine learning (ML) is a promising tool in assisting clinical decision-making for improving diagnosis and prognosis, especially in developing regions. It is often used with large samples, aggregating data from different regions and hospitals. However, it is unclear how this affects predictions in local centers. This study aims to compare data aggregation strategies of several hospitals in Brazil with a local training strategy in each hospital to predict two COVID-19 outcomes: Intensive Care Unit admission (ICU) and mechanical ventilation use (MV). The study included 6,046 patients from 14 hospitals, with local sample sizes ranging from 47 to 1500 patients. Machine learning models were trained using extreme gradient boosting, lightGBM, and catboost for structured data. Seven data aggregation strategies based on hospital geographic regions were compared with local training, and the best strategy was determined by analyzing the area under the ROC curve (AUROC). SHAP (Shapley Additive exPlanations) values were used to assess the contribution of variables to predictions. Additionally, a metafeatures analysis examined how hospital characteristics influence the selection of the best strategy. The study found that the local training strategy was the most effective approach, in the case of ICU outcomes, for 11 of the 14 hospitals (79%), and, in the case of MV, for 10 hospitals (71%). Metafeatures analysis suggested that hospitals with smaller sample sizes generally performed better using an aggregated data strategy compared to local training. Our study brings to light an important concern about the impact of grouping data from different hospitals in predictive machine learning models. These findings contribute to the ongoing debate about the trade-off between increasing sample size and bringing together heterogeneous scenarios.
机器学习(ML)是一种很有前景的工具,可辅助临床决策以改善诊断和预后,尤其是在发展中地区。它通常用于处理大样本,汇总来自不同地区和医院的数据。然而,尚不清楚这对本地医疗中心的预测有何影响。本研究旨在比较巴西几家医院的数据汇总策略与各医院的本地训练策略,以预测两种新冠肺炎结局:重症监护病房收治(ICU)和机械通气使用(MV)。该研究纳入了来自14家医院的6046名患者,各医院的本地样本量从47名到1500名患者不等。使用极端梯度提升、轻量级梯度提升机和类别提升树对结构化数据进行机器学习模型训练。将基于医院地理区域的七种数据汇总策略与本地训练进行比较,并通过分析ROC曲线下面积(AUROC)来确定最佳策略。使用SHAP(夏普力加性解释)值来评估变量对预测的贡献。此外,元特征分析考察了医院特征如何影响最佳策略的选择。研究发现,就ICU结局而言,本地训练策略是最有效的方法,在14家医院中有11家(79%)如此;就MV结局而言,有10家医院(71%)是这样。元特征分析表明,与本地训练相比,样本量较小的医院采用汇总数据策略通常表现更好。我们的研究揭示了一个关于在预测性机器学习模型中对来自不同医院的数据进行分组的影响的重要问题。这些发现有助于正在进行的关于增加样本量与汇集异质情况之间权衡的辩论。