Subramanian Devika, Sonabend Rona, Singh Ila
Department of Computer Science, Rice University, Houston, TX, United States.
Department of Pediatrics, Baylor College of Medicine, Houston, TX, United States.
JMIR Diabetes. 2024 Aug 7;9:e53338. doi: 10.2196/53338.
Diabetic ketoacidosis (DKA) is the leading cause of morbidity and mortality in pediatric type 1 diabetes (T1D), occurring in approximately 20% of patients, with an economic cost of $5.1 billion/year in the United States. Despite multiple risk factors for postdiagnosis DKA, there is still a need for explainable, clinic-ready models that accurately predict DKA hospitalization in established patients with pediatric T1D.
We aimed to develop an interpretable machine learning model to predict the risk of postdiagnosis DKA hospitalization in children with T1D using routinely collected time-series of electronic health record (EHR) data.
We conducted a retrospective case-control study using EHR data from 1787 patients from among 3794 patients with T1D treated at a large tertiary care US pediatric health system from January 2010 to June 2018. We trained a state-of-the-art; explainable, gradient-boosted ensemble (XGBoost) of decision trees with 44 regularly collected EHR features to predict postdiagnosis DKA. We measured the model's predictive performance using the area under the receiver operating characteristic curve-weighted F-score, weighted precision, and recall, in a 5-fold cross-validation setting. We analyzed Shapley values to interpret the learned model and gain insight into its predictions.
Our model distinguished the cohort that develops DKA postdiagnosis from the one that does not (P<.001). It predicted postdiagnosis DKA risk with an area under the receiver operating characteristic curve of 0.80 (SD 0.04), a weighted F-score of 0.78 (SD 0.04), and a weighted precision and recall of 0.83 (SD 0.03) and 0.76 (SD 0.05) respectively, using a relatively short history of data from routine clinic follow-ups post diagnosis. On analyzing Shapley values of the model output, we identified key risk factors predicting postdiagnosis DKA both at the cohort and individual levels. We observed sharp changes in postdiagnosis DKA risk with respect to 2 key features (diabetes age and glycated hemoglobin at 12 months), yielding time intervals and glycated hemoglobin cutoffs for potential intervention. By clustering model-generated Shapley values, we automatically stratified the cohort into 3 groups with 5%, 20%, and 48% risk of postdiagnosis DKA.
We have built an explainable, predictive, machine learning model with potential for integration into clinical workflow. The model risk-stratifies patients with pediatric T1D and identifies patients with the highest postdiagnosis DKA risk using limited follow-up data starting from the time of diagnosis. The model identifies key time points and risk factors to direct clinical interventions at both the individual and cohort levels. Further research with data from multiple hospital systems can help us assess how well our model generalizes to other populations. The clinical importance of our work is that the model can predict patients most at risk for postdiagnosis DKA and identify preventive interventions based on mitigation of individualized risk factors.
糖尿病酮症酸中毒(DKA)是儿童1型糖尿病(T1D)发病和死亡的主要原因,约20%的患者会发生,在美国每年的经济成本为51亿美元。尽管诊断后DKA存在多种风险因素,但仍需要可解释的、适用于临床的模型,以准确预测确诊的儿童T1D患者发生DKA住院的风险。
我们旨在开发一种可解释的机器学习模型,使用常规收集的电子健康记录(EHR)数据的时间序列,预测T1D儿童诊断后发生DKA住院的风险。
我们进行了一项回顾性病例对照研究,使用了2010年1月至2018年6月在美国一家大型三级儿科医疗系统接受治疗的3794例T1D患者中1787例患者的EHR数据。我们训练了一种先进的、可解释的决策树梯度提升集成模型(XGBoost),使用44个常规收集的EHR特征来预测诊断后DKA。在5折交叉验证设置中,我们使用受试者操作特征曲线加权F分数、加权精度和召回率来衡量模型的预测性能。我们分析了Shapley值以解释学习到的模型并深入了解其预测。
我们的模型区分了诊断后发生DKA的队列和未发生DKA的队列(P<0.001)。它使用诊断后常规门诊随访的相对较短的数据历史,预测诊断后DKA风险的受试者操作特征曲线下面积为0.80(标准差0.04),加权F分数为0.78(标准差0.04),加权精度和召回率分别为0.83(标准差0.03)和0.76(标准差0.05)。通过分析模型输出的Shapley值,我们在队列和个体水平上确定了预测诊断后DKA的关键风险因素。我们观察到诊断后DKA风险相对于2个关键特征(糖尿病病程和12个月时的糖化血红蛋白)有急剧变化,得出了潜在干预的时间间隔和糖化血红蛋白临界值。通过对模型生成的Shapley值进行聚类,我们自动将队列分为3组,诊断后DKA风险分别为5%、20%和48%。
我们构建了一个可解释的、预测性的机器学习模型,具有集成到临床工作流程中的潜力。该模型对儿童T1D患者进行风险分层,并使用从诊断时开始的有限随访数据识别诊断后DKA风险最高的患者。该模型确定了关键时间点和风险因素,以指导个体和队列水平的临床干预。来自多个医院系统的数据的进一步研究可以帮助我们评估我们的模型对其他人群的泛化程度。我们工作的临床重要性在于,该模型可以预测诊断后DKA风险最高的患者,并根据个体风险因素的缓解确定预防干预措施。