能否在机器学习管道中利用多样化的人口特征来预测医院服务区内资源密集型医疗保健利用情况？

Can diverse population characteristics be leveraged in a machine learning pipeline to predict resource intensive healthcare utilization among hospital service areas?

机构信息

Department of Epidemiology, Geisel School of Medicine at Dartmouth College, NH, Hanover, USA.

Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth College, NH, Hanover, USA.

出版信息

BMC Health Serv Res. 2022 Jun 30;22(1):847. doi: 10.1186/s12913-022-08154-4.

DOI:10.1186/s12913-022-08154-4

PMID:35773679

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9248096/

Abstract

BACKGROUND

Super-utilizers represent approximately 5% of the population in the United States (U.S.) and yet they are responsible for over 50% of healthcare expenditures. Using characteristics of hospital service areas (HSAs) to predict utilization of resource intensive healthcare (RIHC) may offer a novel and actionable tool for identifying super-utilizer segments in the population. Consumer expenditures may offer additional value in predicting RIHC beyond typical population characteristics alone.

METHODS

Cross-sectional data from 2017 was extracted from 5 unique sources. The outcome was RIHC and included emergency room (ER) visits, inpatient days, and hospital expenditures, all expressed as log per capita. Candidate predictors from 4 broad groups were used, including demographics, adults and child health characteristics, community characteristics, and consumer expenditures. Candidate predictors were expressed as per capita or per capita percent and were aggregated from zip-codes to HSAs using weighed means. Machine learning approaches (Random Forrest, LASSO) selected important features from nearly 1,000 available candidate predictors and used them to generate 4 distinct models, including non-regularized and LASSO regression, random forest, and gradient boosting. Candidate predictors from the best performing models, for each outcome, were used as independent variables in multiple linear regression models. Relative contribution of variables from each candidate predictor group to regression model fit were calculated.

RESULTS

The median ER visits per capita was 0.482 [IQR:0.351-0.646], the median inpatient days per capita was 0.395 [IQR:0.214-0.806], and the median hospital expenditures per capita was $2,302 [1$,544.70-$3,469.80]. Using 1,106 variables, the test-set coefficient of determination (R) from the best performing models ranged between 0.184-0.782. The adjusted R values from multiple linear regression models ranged from 0.311-0.8293. Relative contribution of consumer expenditures to model fit ranged from 23.4-33.6%.

DISCUSSION

Machine learning models predicted RIHC among HSAs using diverse population data, including novel consumer expenditures and provides an innovative tool to predict population-based healthcare utilization and expenditures. Geographic variation in utilization and spending were identified.

摘要

背景

在美国，大约有 5%的人口属于超级使用者，但他们却承担了超过 50%的医疗保健支出。利用医院服务区 (HSA) 的特征来预测资源密集型医疗保健 (RIHC) 的使用情况，可能为识别人群中的超级使用者提供一种新颖且可行的工具。消费者支出在预测 RIHC 方面可能比仅使用典型人口特征提供更多价值。

方法

从 5 个独特的来源提取了 2017 年的横断面数据。结果是 RIHC，包括急诊室 (ER) 就诊次数、住院天数和医院支出，均以人均对数表示。使用了来自 4 个广泛群体的候选预测因子，包括人口统计学、成人和儿童健康特征、社区特征和消费者支出。候选预测因子以人均或人均百分比表示，并使用加权平均值从邮政编码汇总到 HSA。机器学习方法（随机森林、LASSO）从近 1000 个可用候选预测因子中选择重要特征，并使用它们生成 4 个不同的模型，包括非正则化和 LASSO 回归、随机森林和梯度提升。对于每个结果，从表现最佳的模型中选择候选预测因子，并将其用作多元线性回归模型的自变量。计算每个候选预测因子组的变量对回归模型拟合的相对贡献。

结果

人均急诊就诊次数中位数为 0.482 [IQR：0.351-0.646]，人均住院天数中位数为 0.395 [IQR：0.214-0.806]，人均医院支出中位数为 2302 美元 [1544.70-3469.80]。使用 1106 个变量，最佳模型的测试集决定系数 (R) 范围在 0.184-0.782 之间。多元线性回归模型的调整 R 值范围在 0.311-0.8293 之间。消费者支出对模型拟合的相对贡献范围在 23.4-33.6%之间。