Health Informatics Program, Department of Health Administration and Policy, George Mason University, Fairfax, Virginia.
Department of Population Health, New York University School of Medicine, New York, New York.
Big Data. 2018 Sep 1;6(3):214-224. doi: 10.1089/big.2018.0002. Epub 2018 Sep 19.
Existing methods of screening for substance abuse (standardized questionnaires or clinician's simply asking) have proven difficult to initiate and maintain in primary care settings. This article reports on how predictive modeling can be used to screen for substance abuse using extant data in electronic health records (EHRs). We relied on data available through Veterans Affairs Informatics and Computing Infrastructure (VINCI) for the years 2006 through 2016. We focused on 4,681,809 veterans who had at least two primary care visits; 829,827 of whom had a hospitalization. Data included 699 million outpatient and 17 million inpatient records. The dependent variable was substance abuse as identified from 89 diagnostic codes using the Agency for Healthcare Quality and Research classification of diseases. In addition, we included the diagnostic codes used for identification of prescription abuse. The independent variables were 10,292 inpatient and 13,512 outpatient diagnoses, plus 71 dummy variables measuring age at different years between 20 and 90 years. A modified naive Bayes model was used to aggregate the risk across predictors. The accuracy of the predictions was examined using area under the receiver operating characteristic (AROC) curve in 20% of data, randomly set aside for the evaluation. Many physical/mental illnesses were associated with substance abuse. These associations supported findings reported in the literature regarding the impact of substance abuse on various diseases and vice versa. In randomly set-aside validation data, the model accurately predicted substance abuse for inpatient (AROC = 0.884), outpatient (AROC = 0.825), and combined inpatient and outpatient (AROC = 0.840) data. If one excludes information available after substance abuse is known, the cross-validated AROC remained high, 0.822 for inpatient and 0.817 for outpatient data. Data within EHRs can be used to detect existing or predict potential future substance abuse.
现有的药物滥用筛查方法(标准化问卷或临床医生简单询问)已被证明难以在初级保健环境中启动和维持。本文报告了如何使用电子健康记录(EHR)中的现有数据使用预测建模来筛查药物滥用。我们依赖于退伍军人事务信息学和计算基础设施(VINCI)在 2006 年至 2016 年期间提供的数据。我们专注于至少有两次初级保健就诊的 4681809 名退伍军人;其中 829827 人住院。数据包括 6.99 亿次门诊和 1700 万次住院记录。因退伍军人事务部医疗保健质量和研究疾病分类确定的药物滥用是因变量。此外,我们还包括用于识别处方滥用的诊断代码。自变量为 10292 项住院和 13512 项门诊诊断,加上 71 个哑变量,用于测量 20 至 90 岁之间不同年份的年龄。使用改进的朴素贝叶斯模型来汇总预测因子的风险。使用 20%的数据(随机留出用于评估)来检查预测的准确性,这些数据的接收者操作特征(ROC)曲线下面积(AROC)。许多身体/精神疾病与药物滥用有关。这些关联支持了文献中关于药物滥用对各种疾病的影响以及反之亦然的发现。在随机留出的验证数据中,该模型准确预测了住院(AROC=0.884)、门诊(AROC=0.825)和住院和门诊综合(AROC=0.840)数据中的药物滥用。如果排除已知药物滥用后可用的信息,则交叉验证的 AROC 仍然很高,住院数据为 0.822,门诊数据为 0.817。EHR 中的数据可用于检测现有或预测潜在的未来药物滥用。