文献检索，用中文搜 PubMed

BACKGROUND

Machine learning algorithms are currently used in a wide array of clinical domains to produce models that can predict clinical risk events. Most models are developed and evaluated with retrospective data, very few are evaluated in a clinical workflow, and even fewer report performances in different hospitals. In this study, we provide detailed evaluations of clinical risk prediction models in live clinical workflows for three different use cases in three different hospitals.

OBJECTIVE

The main objective of this study was to evaluate clinical risk prediction models in live clinical workflows and compare their performance in these setting with their performance when using retrospective data. We also aimed at generalizing the results by applying our investigation to three different use cases in three different hospitals.

METHODS

We trained clinical risk prediction models for three use cases (ie, delirium, sepsis, and acute kidney injury) in three different hospitals with retrospective data. We used machine learning and, specifically, deep learning to train models that were based on the Transformer model. The models were trained using a calibration tool that is common for all hospitals and use cases. The models had a common design but were calibrated using each hospital's specific data. The models were deployed in these three hospitals and used in daily clinical practice. The predictions made by these models were logged and correlated with the diagnosis at discharge. We compared their performance with evaluations on retrospective data and conducted cross-hospital evaluations.

RESULTS

The performance of the prediction models with data from live clinical workflows was similar to the performance with retrospective data. The average value of the area under the receiver operating characteristic curve (AUROC) decreased slightly by 0.6 percentage points (from 94.8% to 94.2% at discharge). The cross-hospital evaluations exhibited severely reduced performance: the average AUROC decreased by 8 percentage points (from 94.2% to 86.3% at discharge), which indicates the importance of model calibration with data from the deployment hospital.

CONCLUSIONS

Calibrating the prediction model with data from different deployment hospitals led to good performance in live settings. The performance degradation in the cross-hospital evaluation identified limitations in developing a generic model for different hospitals. Designing a generic process for model development to generate specialized prediction models for each hospital guarantees model performance in different hospitals.

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

机器学习算法目前被广泛应用于各个临床领域，以构建能够预测临床风险事件的模型。大多数模型都是基于回顾性数据开发和评估的，只有极少数模型在临床工作流程中进行了评估，而更少的模型报告了在不同医院的表现。在这项研究中，我们针对三个不同医院的三个不同用例，在实际临床工作流程中对临床风险预测模型进行了详细评估。

目的

本研究的主要目的是评估实际临床工作流程中的临床风险预测模型，并比较其在这些环境中的表现与使用回顾性数据时的表现。我们还旨在通过将我们的研究应用于三个不同医院的三个不同用例来推广结果。

方法

我们使用回顾性数据在三个不同医院为三个用例（即谵妄、脓毒症和急性肾损伤）训练临床风险预测模型。我们使用机器学习，特别是基于 Transformer 模型的深度学习来训练模型。这些模型是使用所有医院和用例都通用的校准工具进行训练的。这些模型具有相同的设计，但使用每个医院的特定数据进行校准。这些模型在这三个医院中进行了部署，并在日常临床实践中使用。这些模型的预测结果被记录并与出院时的诊断相关联。我们将它们在实际临床工作流程中的表现与在回顾性数据上的评估进行了比较，并进行了跨医院评估。

结果

实际临床工作流程中数据的预测模型的性能与回顾性数据的性能相似。接收者操作特征曲线下面积（AUROC）的平均值略有下降，为 0.6 个百分点（从出院时的 94.8%降至 94.2%）。跨医院评估显示性能严重下降，AUROC 平均下降 8 个百分点（从出院时的 94.2%降至 86.3%），这表明使用部署医院的数据对模型进行校准的重要性。