Center for Home Care Policy & Research, VNS Health, New York, New York, USA.
College of Nursing, The University of Iowa, Iowa City, Iowa, USA.
Int J Med Inform. 2024 Nov;191:105534. doi: 10.1016/j.ijmedinf.2024.105534. Epub 2024 Jun 30.
This study aims to evaluate the fairness performance metrics of Machine Learning (ML) models to predict hospitalization and emergency department (ED) visits in heart failure patients receiving home healthcare. We analyze biases, assess performance disparities, and propose solutions to improve model performance in diverse subpopulations.
The study used a dataset of 12,189 episodes of home healthcare collected between 2015 and 2017, including structured (e.g., standard assessment tool) and unstructured data (i.e., clinical notes). ML risk prediction models, including Light Gradient-boosting model (LightGBM) and AutoGluon, were developed using demographic information, vital signs, comorbidities, service utilization data, and the area deprivation index (ADI) associated with the patient's home address. Fairness metrics, such as Equal Opportunity, Predictive Equality, Predictive Parity, and Statistical Parity, were calculated to evaluate model performance across subpopulations.
Our study revealed significant disparities in model performance across diverse demographic subgroups. For example, the Hispanic, Male, High-ADI subgroup excelled in terms of Equal Opportunity with a metric value of 0.825, which was 28% higher than the lowest-performing Other, Female, Low-ADI subgroup, which scored 0.644. In Predictive Parity, the gap between the highest and lowest-performing groups was 29%, and in Statistical Parity, the gap reached 69%. In Predictive Equality, the difference was 45%.
The findings highlight substantial differences in fairness metrics across diverse patient subpopulations in ML risk prediction models for heart failure patients receiving home healthcare services. Ongoing monitoring and improvement of fairness metrics are essential to mitigate biases.
本研究旨在评估机器学习(ML)模型在预测接受家庭保健服务的心力衰竭患者住院和急诊就诊方面的公平绩效指标。我们分析偏差,评估绩效差异,并提出解决方案,以提高不同亚组人群中模型的性能。
该研究使用了 2015 年至 2017 年期间收集的 12189 例家庭保健数据,包括结构化(例如,标准评估工具)和非结构化数据(即临床记录)。使用人口统计学信息、生命体征、合并症、服务利用数据以及与患者家庭地址相关的地区贫困指数(ADI),开发了机器学习风险预测模型,包括 Light Gradient-boosting 模型(LightGBM)和 AutoGluon。计算了公平性指标,如均等机会、预测均等、预测均等和统计均等,以评估模型在不同亚组人群中的性能。
我们的研究表明,在不同的人口统计学亚组中,模型的性能存在显著差异。例如,西班牙裔、男性、高 ADI 亚组在均等机会方面表现出色,指标值为 0.825,比表现最差的其他、女性、低 ADI 亚组高 28%,后者得分为 0.644。在预测均等方面,最高和最低分组之间的差距为 29%,在统计均等方面,差距达到 69%。在预测均等方面,差异为 45%。
研究结果强调了在接受家庭保健服务的心力衰竭患者的 ML 风险预测模型中,不同患者亚组人群的公平性指标存在显著差异。持续监测和改进公平性指标对于减轻偏差至关重要。