Center for Computational Health, IBM Research, Cambridge, Massachusetts.
Center for Computational Health, IBM TJ Watson Research Center, Yorktown Heights, NY.
JAMA Netw Open. 2021 Apr 1;4(4):e213909. doi: 10.1001/jamanetworkopen.2021.3909.
The lack of standards in methods to reduce bias for clinical algorithms presents various challenges in providing reliable predictions and in addressing health disparities.
To evaluate approaches for reducing bias in machine learning models using a real-world clinical scenario.
DESIGN, SETTING, AND PARTICIPANTS: Health data for this cohort study were obtained from the IBM MarketScan Medicaid Database. Eligibility criteria were as follows: (1) Female individuals aged 12 to 55 years with a live birth record identified by delivery-related codes from January 1, 2014, through December 31, 2018; (2) greater than 80% enrollment through pregnancy to 60 days post partum; and (3) evidence of coverage for depression screening and mental health services. Statistical analysis was performed in 2020.
Binarized race (Black individuals and White individuals).
Machine learning models (logistic regression [LR], random forest, and extreme gradient boosting) were trained for 2 binary outcomes: postpartum depression (PPD) and postpartum mental health service utilization. Risk-adjusted generalized linear models were used for each outcome to assess potential disparity in the cohort associated with binarized race (Black or White). Methods for reducing bias, including reweighing, Prejudice Remover, and removing race from the models, were examined by analyzing changes in fairness metrics compared with the base models. Baseline characteristics of female individuals at the top-predicted risk decile were compared for systematic differences. Fairness metrics of disparate impact (DI, 1 indicates fairness) and equal opportunity difference (EOD, 0 indicates fairness).
Among 573 634 female individuals initially examined for this study, 314 903 were White (54.9%), 217 899 were Black (38.0%), and the mean (SD) age was 26.1 (5.5) years. The risk-adjusted odds ratio comparing White participants with Black participants was 2.06 (95% CI, 2.02-2.10) for clinically recognized PPD and 1.37 (95% CI, 1.33-1.40) for postpartum mental health service utilization. Taking the LR model for PPD prediction as an example, reweighing reduced bias as measured by improved DI and EOD metrics from 0.31 and -0.19 to 0.79 and 0.02, respectively. Removing race from the models had inferior performance for reducing bias compared with the other methods (PPD: DI = 0.61; EOD = -0.05; mental health service utilization: DI = 0.63; EOD = -0.04).
Clinical prediction models trained on potentially biased data may produce unfair outcomes on the basis of the chosen metrics. This study's results suggest that the performance varied depending on the model, outcome label, and method for reducing bias. This approach toward evaluating algorithmic bias can be used as an example for the growing number of researchers who wish to examine and address bias in their data and models.
临床算法中减少偏差的方法缺乏标准,这在提供可靠预测和解决健康差异方面带来了各种挑战。
使用真实临床场景评估机器学习模型中减少偏差的方法。
设计、设置和参与者:本队列研究的健康数据来自 IBM MarketScan 医疗补助数据库。入选标准如下:(1)12 至 55 岁的女性个体,通过分娩相关代码确定有活产记录,时间范围为 2014 年 1 月 1 日至 2018 年 12 月 31 日;(2)妊娠至产后 60 天期间 80%以上时间入组;(3)有抑郁筛查和心理健康服务的覆盖证据。统计分析于 2020 年进行。
二值化种族(黑人个体和白人个体)。
为 2 个二分类结局(产后抑郁症[PPD]和产后心理健康服务利用)训练机器学习模型(逻辑回归[LR]、随机森林和极端梯度增强)。使用每个结局的风险调整广义线性模型评估与二值化种族(黑人或白人)相关的队列中潜在的差异。通过分析与基础模型相比公平性指标的变化,研究了减少偏差的方法,包括重新加权、偏见消除和从模型中去除种族。比较了处于最高预测风险十分位数的女性个体的基线特征是否存在系统性差异。公平性指标包括差异影响(DI,1 表示公平)和均等机会差异(EOD,0 表示公平)。
在最初纳入本研究的 573634 名女性个体中,314903 名(54.9%)为白人,217899 名(38.0%)为黑人,平均(SD)年龄为 26.1(5.5)岁。与白人参与者相比,风险调整后的比值比显示,黑人参与者发生临床识别的 PPD 的比值为 2.06(95%CI,2.02-2.10),产后心理健康服务利用率的比值为 1.37(95%CI,1.33-1.40)。以 LR 模型预测 PPD 为例,重新加权可改善公平性指标,使 DI 和 EOD 分别从 0.31 和-0.19 提高至 0.79 和 0.02。与其他方法相比,从模型中去除种族会降低减少偏差的效果(PPD:DI=0.61;EOD=-0.05;心理健康服务利用率:DI=0.63;EOD=-0.04)。
基于所选指标,在潜在存在偏差的数据上训练的临床预测模型可能会产生不公平的结果。本研究结果表明,不同方法的模型、结局标签和减少偏差的性能不同。这种评估算法偏差的方法可以作为越来越多希望检查和解决数据和模型中偏差的研究人员的一个范例。