Lett Elle, Shahbandegan Shakiba, Barak-Corren Yuval, Fine Andrew M, La Cava William G
Center for Anti-Racism and Community Health, University of Washington School of Public Health, Seattle.
Health Systems and Population Health, University of Washington School of Public Health, Seattle.
JAMA Netw Open. 2025 May 1;8(5):e2512947. doi: 10.1001/jamanetworkopen.2025.12947.
Fair clinical prediction models are crucial for achieving equitable health outcomes. Intersectionality has been applied to develop algorithms that address discrimination among intersections of protected attributes (eg, Black women rather than Black persons or women separately), yet most fair algorithms default to marginal debiasing, optimizing performance across simplified patient subgroups.
To assess the extent to which simplifying patient subgroups during training is associated with intersectional subgroup performance in emergency department (ED) admission models.
DESIGN, SETTING, AND PARTICIPANTS: This prognostic study of admission prediction models used retrospective data from ED visits to Beth Israel Deaconess Medical Center Medical Information Mart for Intensive Care IV (MIMIC-IV; n = 160 016) from January 1, 2011, to December 31, 2019, and Boston Children's Hospital (BCH; n = 22 222) from June 1 through August 13, 2019. Statistical analysis was conducted from January 2022 to August 2024.
The primary outcome was admission to an in-patient service. The accuracy of admission predictions among intersectional subgroups was measured under variations on model training with respect to optimizing for group level performance. Under different fairness definitions (calibration, error rate balance) and modeling methods (linear, nonlinear), overall performance and subgroup performance of marginal debiasing approaches were compared with intersectional debiasing approaches. Subgroups were defined by self-reported race and ethnicity and gender. Measures include area under the receiver operator characteristic curve (AUROC), area under the precision recall curve, subgroup calibration error, and false-negative rates.
The MIMIC-IV cohort included 160 016 visits (mean [SD] age, 53.0 [19.3] years; 57.4% female patients; 0.3% American Indian or Alaska Native patients, 3.7% Asian patients, 26.2% Black patients, 10.0% Hispanic or Latino patients, and 59.7% White patients; 29.5% admitted) and the BCH cohort included 22 222 visits (mean [SD] age, 8.2 [6.8] years; 52.1% male patients; 0.1% American Indian or Alaska Native patients, 4.0% Asian patients, 19.7% Black patients, 30.6% Hispanic or Latino patients, 0.2% Native Hawaiian or Pacific Islander patients, 37.7% White patients; 16.3% admitted). Among MIMIC-IV groups, intersectional debiasing was associated with a reduced subgroup calibration error from 0.083 to 0.065 (22.3%), while marginal fairness debiasing was associated with a reduced subgroup calibration error from 0.083 to 0.074 (11.3%; difference, 11.1%); among BCH groups, intersectional debiasing was associated with a reduced subgroup calibration error from 0.111 to 0.080 (28.3%), while marginal fairness debiasing was associated with a reduced subgroup calibration error from 0.111 to 0.086 (22.6%; difference, 5.7%). Among MIMIC-IV groups, intersectional debiasing was associated with lowered subgroup false-negative rates from 0.142 to 0.125 (11.9%), while marginal debiasing was associated with lowered subgroup false-negative rates from 0.142 to 0.132 (6.8%; difference, 5.1%). Fairness improvements did not decrease overall accuracy compared with baseline models (eg, MIMIC-IV: mean [SD] AUROC, 0.85 [0.00], both models). Intersectional debiasing was associated with lowered error rates in several intersectional subpopulations compared with other strategies.
This study suggests that intersectional debiasing better mitigates performance disparities across intersecting groups than marginal debiasing for admission prediction. Intersectionally debiased models were associated with reduced group-specific errors without compromising overall accuracy. Clinical risk prediction models should consider incorporating intersectional debiasing into their development.
公平的临床预测模型对于实现公平的健康结果至关重要。交叉性已被应用于开发算法,以解决受保护属性交叉点之间的歧视问题(例如,黑人女性,而不是分别针对黑人或女性),然而,大多数公平算法默认采用边际偏差调整,在简化的患者亚组中优化性能。
评估在急诊科(ED)入院模型中,训练期间简化患者亚组与交叉亚组性能之间的关联程度。
设计、设置和参与者:这项入院预测模型的预后研究使用了回顾性数据,这些数据来自2011年1月1日至2019年12月31日贝斯以色列女执事医疗中心重症监护医学信息库IV(MIMIC-IV;n = 160016)的急诊就诊,以及2019年6月1日至8月13日波士顿儿童医院(BCH;n = 22222)的急诊就诊。统计分析于2022年1月至2024年8月进行。
主要结果是入住住院服务。在针对组水平性能进行优化的模型训练变体下,测量交叉亚组中入院预测的准确性。在不同的公平性定义(校准、错误率平衡)和建模方法(线性、非线性)下,将边际偏差调整方法的总体性能和亚组性能与交叉偏差调整方法进行比较。亚组由自我报告的种族、民族和性别定义。测量指标包括受试者操作特征曲线下面积(AUROC)、精确召回率曲线下面积、亚组校准误差和假阴性率。
MIMIC-IV队列包括160016次就诊(平均[标准差]年龄,53.0[19.3]岁;57.4%为女性患者;0.3%为美洲印第安人或阿拉斯加原住民患者,3.7%为亚洲患者,26.2%为黑人患者,10.0%为西班牙裔或拉丁裔患者,59.7%为白人患者;29.5%入院),BCH队列包括22222次就诊(平均[标准差]年龄,8.2[6.8]岁;52.1%为男性患者;0.1%为美洲印第安人或阿拉斯加原住民患者,4.0%为亚洲患者,19.7%为黑人患者,30.6%为西班牙裔或拉丁裔患者,0.2%为夏威夷原住民或太平洋岛民患者,37.7%为白人患者;16.3%入院)。在MIMIC-IV组中,交叉偏差调整与亚组校准误差从0.083降至0.065(降低22.3%)相关,而边际公平偏差调整与亚组校准误差从0.083降至0.074(降低11.3%;差异为11.1%)相关;在BCH组中,交叉偏差调整与亚组校准误差从0.111降至0.080(降低28.3%)相关,而边际公平偏差调整与亚组校准误差从0.111降至0.086(降低22.6%;差异为5.7%)相关。在MIMIC-IV组中,交叉偏差调整与亚组假阴性率从0.142降至0.125(降低11.9%)相关,而边际偏差调整与亚组假阴性率从0.142降至0.132(降低6.8%;差异为5.1%)相关。与基线模型相比,公平性的改善并未降低总体准确性(例如,MIMIC-IV:平均[标准差]AUROC,0.85[0.00],两种模型)。与其他策略相比,交叉偏差调整与几个交叉亚人群中的错误率降低相关。
这项研究表明,对于入院预测,交叉偏差调整比边际偏差调整能更好地减轻交叉群体之间的性能差异。交叉偏差调整模型与降低特定群体误差相关,同时不影响总体准确性。临床风险预测模型应考虑在其开发过程中纳入交叉偏差调整。