Gregory Megan E, Kasthurirathne Suranga N, Magoc Tanja, McNamee Cassidy, Harle Christopher A, Vest Joshua R
Department of Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32610, United States.
Center for Biomedical Informatics, Regenstrief Institute, Indianapolis, IN 46202, United States.
JAMIA Open. 2025 Jan 7;8(1):ooae150. doi: 10.1093/jamiaopen/ooae150. eCollection 2025 Feb.
Measurement of health-related social needs (HRSNs) is complex. We sought to develop and validate computable phenotypes (CPs) using structured electronic health record (EHR) data for food insecurity, housing instability, financial insecurity, transportation barriers, and a composite-type measure of these, using human-defined rule-based and machine learning (ML) classifier approaches.
We collected HRSN surveys as the reference standard and obtained EHR data from 1550 patients in 3 health systems from 2 states. We followed a Delphi-like approach to develop the human-defined rule-based CP. For the ML classifier approach, we trained supervised ML (XGBoost) models using 78 features. Using surveys as the reference standard, we calculated sensitivity, specificity, positive predictive values, and area under the curve (AUC). We compared AUCs using the Delong test and other performance measures using McNemar's test, and checked for differential performance.
Most patients (63%) reported at least one HRSN on the reference standard survey. Human-defined rule-based CPs exhibited poor performance (AUCs=.52 to .68). ML classifier CPs performed significantly better, but still poor-to-fair (AUCs = .68 to .75). Significant differences for race/ethnicity were found for ML classifier CPs (higher AUCs for White non-Hispanic patients). Important features included number of encounters and Medicaid insurance.
Using a supervised ML classifier approach, HRSN CPs approached thresholds of fair performance, but exhibited differential performance by race/ethnicity.
CPs may help to identify patients who may benefit from additional social needs screening. Future work should explore the use of area-level features via geospatial data and natural language processing to improve model performance.
健康相关社会需求(HRSNs)的测量很复杂。我们试图利用结构化电子健康记录(EHR)数据,通过基于人工定义规则和机器学习(ML)分类器方法,开发并验证用于衡量粮食不安全、住房不稳定、经济不安全、交通障碍以及这些因素综合指标的可计算表型(CPs)。
我们收集了HRSN调查作为参考标准,并从来自两个州的3个医疗系统的1550名患者中获取了EHR数据。我们采用类似德尔菲法的方法来开发基于人工定义规则的CP。对于ML分类器方法,我们使用78个特征训练监督式ML(XGBoost)模型。以调查作为参考标准,我们计算了灵敏度、特异性、阳性预测值和曲线下面积(AUC)。我们使用德龙检验比较AUC,使用麦克尼马尔检验比较其他性能指标,并检查是否存在差异性能。
在参考标准调查中,大多数患者(63%)报告至少有一种HRSN。基于人工定义规则的CP表现不佳(AUC为0.52至0.68)。ML分类器CP的表现明显更好,但仍为差到一般(AUC为0.68至0.75)。在ML分类器CP中发现了种族/民族的显著差异(非西班牙裔白人患者的AUC更高)。重要特征包括就诊次数和医疗补助保险。
使用监督式ML分类器方法,HRSN CPs接近了一般性能的阈值,但在种族/民族方面表现出差异。
CPs可能有助于识别可能从额外的社会需求筛查中受益的患者。未来的工作应探索通过地理空间数据和自然语言处理使用区域层面特征来提高模型性能。