Cabanillas Silva Patricia, Sun Hong, Rezk Mohamed, Roccaro-Waldmeyer Diana M, Fliegenschmidt Janis, Hulde Nikolai, von Dossow Vera, Meesseman Laurent, Depraetere Kristof, Stieg Joerg, Szymanowsky Ralph, Dahlweid Fried-Michael
Dedalus HealthCare, Antwerp, Belgium.
Provincial Key Laboratory of Multimodal Perceiving and Intelligent Systems, Jiaxing University, Jiaxing, China.
J Med Internet Res. 2024 Dec 13;26:e51409. doi: 10.2196/51409.
BACKGROUND: In recent years, machine learning (ML)-based models have been widely used in clinical domains to predict clinical risk events. However, in production, the performances of such models heavily rely on changes in the system and data. The dynamic nature of the system environment, characterized by continuous changes, has significant implications for prediction models, leading to performance degradation and reduced clinical efficacy. Thus, monitoring model shifts and evaluating their impact on prediction models are of utmost importance. OBJECTIVE: This study aimed to assess the impact of a model shift on ML-based prediction models by evaluating 3 different use cases-delirium, sepsis, and acute kidney injury (AKI)-from 2 hospitals (M and H) with different patient populations and investigate potential model deterioration during the COVID-19 pandemic period. METHODS: We trained prediction models using retrospective data from earlier years and examined the presence of a model shift using data from more recent years. We used the area under the receiver operating characteristic curve (AUROC) to evaluate model performance and analyzed the calibration curves over time. We also assessed the influence on clinical decisions by evaluating the alert rate, the rates of over- and underdiagnosis, and the decision curve. RESULTS: The 2 data sets used in this study contained 189,775 and 180,976 medical cases for hospitals M and H, respectively. Statistical analyses (Z test) revealed no significant difference (P>.05) between the AUROCs from the different years for all use cases and hospitals. For example, in hospital M, AKI did not show a significant difference between 2020 (AUROC=0.898) and 2021 (AUROC=0.907, Z=-1.171, P=.242). Similar results were observed in both hospitals and for all use cases (sepsis and delirium) when comparing all the different years. However, when evaluating the calibration curves at the 2 hospitals, model shifts were observed for the delirium and sepsis use cases but not for AKI. Additionally, to investigate the clinical utility of our models, we performed decision curve analysis (DCA) and compared the results across the different years. A pairwise nonparametric statistical comparison showed no differences in the net benefit at the probability thresholds of interest (P>.05). The comprehensive evaluations performed in this study ensured robust model performance of all the investigated models across the years. Moreover, neither performance deteriorations nor alert surges were observed during the COVID-19 pandemic period. CONCLUSIONS: Clinical risk prediction models were affected by the dynamic and continuous evolution of clinical practices and workflows. The performance of the models evaluated in this study appeared stable when assessed using AUROCs, showing no significant variations over the years. Additional model shift investigations suggested that a calibration shift was present for certain use cases (delirium and sepsis). However, these changes did not have any impact on the clinical utility of the models based on DCA. Consequently, it is crucial to closely monitor data changes and detect possible model shifts, along with their potential influence on clinical decision-making.
背景:近年来,基于机器学习(ML)的模型已广泛应用于临床领域以预测临床风险事件。然而,在实际应用中,此类模型的性能严重依赖于系统和数据的变化。以持续变化为特征的系统环境的动态性质对预测模型具有重大影响,会导致性能下降和临床疗效降低。因此,监测模型变化并评估其对预测模型的影响至关重要。 目的:本研究旨在通过评估来自两家具有不同患者群体的医院(M医院和H医院)的3种不同用例——谵妄、脓毒症和急性肾损伤(AKI),来评估模型变化对基于ML的预测模型的影响,并调查COVID-19大流行期间潜在的模型恶化情况。 方法:我们使用早年的回顾性数据训练预测模型,并使用近年的数据检查模型变化的存在情况。我们使用受试者操作特征曲线下面积(AUROC)来评估模型性能,并分析随时间变化的校准曲线。我们还通过评估警报率、过度诊断和漏诊率以及决策曲线来评估对临床决策的影响。 结果:本研究中使用的两个数据集分别包含M医院和H医院的189,775例和180,976例医疗病例。统计分析(Z检验)显示,所有用例和医院不同年份的AUROC之间无显著差异(P>0.05)。例如,在M医院,AKI在2020年(AUROC = 0.898)和2021年(AUROC = 0.907,Z = -1.171,P = 0.242)之间未显示出显著差异。在比较所有不同年份时,两家医院的所有用例(脓毒症和谵妄)均观察到类似结果。然而,在评估两家医院的校准曲线时,观察到谵妄和脓毒症用例存在模型变化,而AKI则没有。此外,为了研究我们模型的临床效用,我们进行了决策曲线分析(DCA)并比较了不同年份的结果。成对非参数统计比较显示,在感兴趣的概率阈值下净效益无差异(P>0.05)。本研究中进行的综合评估确保了所有被调查模型多年来的稳健模型性能。此外,在COVID-19大流行期间未观察到性能下降或警报激增。 结论:临床风险预测模型受到临床实践和工作流程的动态持续演变的影响。当使用AUROC评估时,本研究中评估的模型性能似乎稳定,多年来未显示出显著变化。额外的模型变化调查表明,某些用例(谵妄和脓毒症)存在校准变化。然而,这些变化对基于DCA的模型的临床效用没有任何影响。因此,密切监测数据变化并检测可能的模型变化及其对临床决策的潜在影响至关重要。
Clin Orthop Relat Res. 2024-9-1
Cochrane Database Syst Rev. 2022-5-20
Cochrane Database Syst Rev. 2021-4-19
Health Technol Assess. 2006-9
BMJ Health Care Inform. 2023-7
BMC Med. 2023-2-24
JAMA Netw Open. 2021-11-1
Curr Opin Crit Care. 2021-10-1
N Engl J Med. 2021-7-15