Ryu Alexander J, Romero-Brufau Santiago, Qian Ray, Heaton Heather A, Nestler David M, Ayanian Shant, Kingsley Thomas C
Division of Hospital Internal Medicine, Mayo Clinic, Rochester, MN.
Department of Medicine, Mayo Clinic, Rochester, MN.
Mayo Clin Proc Innov Qual Outcomes. 2022 Apr 26;6(3):193-199. doi: 10.1016/j.mayocpiqo.2022.03.003. eCollection 2022 Jun.
To assess the generalizability of a clinical machine learning algorithm across multiple emergency departments (EDs).
We obtained data on all ED visits at our health care system's largest ED from May 5, 2018, to December 31, 2019. We also obtained data from 3 satellite EDs and 1 distant-hub ED from May 1, 2018, to December 31, 2018. A gradient-boosted machine model was trained on pooled data from the included EDs. To prevent the effect of differing training set sizes, the data were randomly downsampled to match those of our smallest ED. A second model was trained on this downsampled, pooled data. The model's performance was compared using area under the receiver operating characteristic (AUC). Finally, site-specific models were trained and tested across all the sites, and the importance of features was examined to understand the reasons for differing generalizability.
The training data sets contained 1918-64,161 ED visits. The AUC for the pooled model ranged from 0.84 to 0.94 across the sites; the performance decreased slightly when Ns were downsampled to match those of our smallest ED site. When site-specific models were trained and tested across all the sites, the AUCs ranged more widely from 0.71 to 0.93. Within a single ED site, the performance of the 5 site-specific models was most variable for our largest and smallest EDs. Finally, when the importance of features was examined, several features were common to all site-specific models; however, the weight of these features differed.
A machine learning model for predicting hospital admission from the ED will generalize fairly well within the health care system but will still have significant differences in AUC performance across sites because of site-specific factors.
评估一种临床机器学习算法在多个急诊科(ED)中的通用性。
我们获取了2018年5月5日至2019年12月31日期间,我们医疗系统中最大的急诊科所有急诊就诊的数据。我们还获取了2018年5月1日至2018年12月31日期间,3个卫星急诊科和1个远程枢纽急诊科的数据。在纳入的急诊科的汇总数据上训练了一个梯度提升机模型。为防止不同训练集大小的影响,对数据进行随机下采样,以匹配我们最小急诊科的数据。在这个下采样的汇总数据上训练了第二个模型。使用受试者操作特征曲线下面积(AUC)比较模型的性能。最后,在所有站点上训练并测试特定站点模型,并检查特征的重要性,以了解通用性差异的原因。
训练数据集包含1918 - 64161次急诊就诊。汇总模型在各站点的AUC范围为0.84至0.94;当样本量下采样以匹配我们最小的急诊科站点时,性能略有下降。当在所有站点上训练并测试特定站点模型时,AUC范围更广泛,从0.71到0.93。在单个急诊科站点内,5个特定站点模型的性能在我们最大和最小的急诊科中变化最大。最后,当检查特征的重要性时,几个特征在所有特定站点模型中是共同的;然而,这些特征的权重不同。
一种用于预测急诊科住院情况的机器学习模型在医疗系统内的通用性相当好,但由于特定站点因素,各站点的AUC性能仍会有显著差异。