Brown Katherine E, Wrenn Jesse O, Jackson Nicholas J, Cauley Michael R, Collins Benjamin, Novak Laurie Lovett, Malin Bradley A, Ancker Jessica S
Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee.
Department of Emergency Medicine, Vanderbilt University Medical Center, Nashville, Tennessee.
medRxiv. 2025 Jun 24:2025.06.24.25330212. doi: 10.1101/2025.06.24.25330212.
Healthcare decisions are increasingly made with the assistance of machine learning (ML). ML has been known to have unfairness - inconsistent outcomes across subpopulations. Clinicians interacting with these systems can perpetuate such unfairness by overreliance. Recent work exploring ML suppression - silencing predictions based on auditing the ML - shows promise in mitigating performance issues originating from overreliance. This study aims to evaluate the impact of suppression on collaboration fairness and evaluate ML uncertainty as desiderata to audit the ML.
We used data from the Vanderbilt University Medical Center electronic health record (n = 58,817) and the MIMIC-IV-ED dataset (n = 363,145) to predict likelihood of death or ICU transfer and likelihood of 30-day readmission. Our simulation study used gradient-boosted trees as well as an artificially high-performing oracle model. We derived clinician decisions directly from the dataset and simulated clinician acceptance of ML predictions based on previous empirical work on acceptance of CDS alerts. We measured performance as area under the receiver operating characteristic curve and algorithmic fairness using absolute averaged odds difference.
When the ML outperforms humans, suppression outperforms the human alone (p < 0.034) and at least does not degrade fairness. When the human outperforms the ML, suppression outperforms the human (p < 5.2 × 10) but the human is fairer than suppression (p < 0.0019). Finally, incorporating uncertainty quantification into suppression approaches can improve performance.
Suppression of poor-quality ML predictions through an auditor model shows promise in improving collaborative human-AI performance and fairness.
医疗保健决策越来越多地借助机器学习(ML)来做出。众所周知,ML存在不公平性——不同亚群体的结果不一致。与这些系统交互的临床医生可能因过度依赖而使这种不公平持续存在。最近基于对ML进行审计来探索ML抑制(即沉默预测)的工作,在缓解因过度依赖而产生的性能问题方面显示出前景。本研究旨在评估抑制对协作公平性的影响,并将ML不确定性作为审计ML的必要条件进行评估。
我们使用范德比尔特大学医学中心电子健康记录数据(n = 58,817)和MIMIC-IV-ED数据集(n = 363,145)来预测死亡或转入重症监护病房的可能性以及30天再入院的可能性。我们的模拟研究使用了梯度提升树以及一个人为高性能的预言模型。我们直接从数据集中得出临床医生的决策,并根据先前关于接受临床决策支持(CDS)警报的实证研究,模拟临床医生对ML预测的接受情况。我们将性能衡量为受试者操作特征曲线下的面积,并使用绝对平均优势差来衡量算法公平性。
当ML的表现优于人类时,抑制的表现优于单独的人类(p < 0.034),并且至少不会降低公平性。当人类的表现优于ML时,抑制的表现优于人类(p < 5.2 × 10),但人类比抑制更公平(p < 0.0019)。最后,将不确定性量化纳入抑制方法可以提高性能。
通过审计模型抑制质量较差的ML预测,在改善人机协作性能和公平性方面显示出前景。