Suppr超能文献

偏差还是生物学?电子健康记录机器学习研究中模型解释的重要性。

Bias or biology? Importance of model interpretation in machine learning studies from electronic health records.

作者信息

Momenzadeh Amanda, Shamsa Ali, Meyer Jesse G

机构信息

Department of Biochemistry, Medical College of Wisconsin, Milwaukee, Wisconsin, USA.

出版信息

JAMIA Open. 2022 Aug 8;5(3):ooac063. doi: 10.1093/jamiaopen/ooac063. eCollection 2022 Oct.

Abstract

OBJECTIVE

The rate of diabetic complication progression varies across individuals and understanding factors that alter the rate of complication progression may uncover new clinical interventions for personalized diabetes management.

MATERIALS AND METHODS

We explore how various machine learning (ML) models and types of electronic health records (EHRs) can predict fast versus slow onset of neuropathy, nephropathy, ocular disease, or cardiovascular disease using only patient data collected prior to diabetes diagnosis.

RESULTS

We find that optimized random forest models performed best to accurately predict the diagnosis of a diabetic complication, with the most effective model distinguishing between fast versus slow nephropathy (AUROC = 0.75). Using all data sets combined allowed for the highest model predictive performance, and social history or laboratory alone were most predictive. SHapley Additive exPlanations (SHAP) model interpretation allowed for exploration of predictors of fast and slow complication diagnosis, including underlying biases present in the EHR. Patients in the fast group had more medical visits, incurring a potential informed decision bias.

DISCUSSION

Our study is unique in the realm of ML studies as it leverages SHAP as a starting point to explore patient markers not routinely used in diabetes monitoring. A mix of both bias and biological processes is likely present in influencing a model's ability to distinguish between groups.

CONCLUSION

Overall, model interpretation is a critical step in evaluating validity of a user-intended endpoint for a model when using EHR data, and predictors affected by bias and those driven by biologic processes should be equally recognized.

摘要

目的

糖尿病并发症进展速度因人而异,了解影响并发症进展速度的因素可能会发现个性化糖尿病管理的新临床干预措施。

材料与方法

我们探讨了各种机器学习(ML)模型和电子健康记录(EHR)类型如何仅使用糖尿病诊断前收集的患者数据来预测神经病变、肾病、眼部疾病或心血管疾病的快速发作与缓慢发作。

结果

我们发现优化后的随机森林模型在准确预测糖尿病并发症诊断方面表现最佳,最有效的模型能够区分快速肾病与缓慢肾病(曲线下面积[AUC] = 0.75)。使用所有数据集组合可实现最高的模型预测性能,单独的社会史或实验室数据预测性最强。SHapley 加性解释(SHAP)模型解释有助于探索快速和缓慢并发症诊断的预测因素,包括电子健康记录中存在的潜在偏差。快速组的患者就诊次数更多,可能存在知情决策偏差。

讨论

我们的研究在机器学习研究领域具有独特性,因为它利用 SHAP 作为起点来探索糖尿病监测中不常用的患者标志物。偏差和生物学过程可能共同影响模型区分不同组别的能力。

结论

总体而言,在使用电子健康记录数据时,模型解释是评估模型预期用户终点有效性的关键步骤,受偏差影响的预测因素和由生物学过程驱动的预测因素应得到同等重视。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f5c3/9360778/cfaecb86f091/ooac063f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验