Department of Psychiatry & Behavioral Sciences, Rush University Medical Center, Chicago, Illinois, USA.
Department of Computer Science, Loyola University, Chicago, Illinois, USA.
J Am Med Inform Assoc. 2021 Oct 12;28(11):2393-2403. doi: 10.1093/jamia/ocab148.
To assess fairness and bias of a previously validated machine learning opioid misuse classifier.
MATERIALS & METHODS: Two experiments were conducted with the classifier's original (n = 1000) and external validation (n = 53 974) datasets from 2 health systems. Bias was assessed via testing for differences in type II error rates across racial/ethnic subgroups (Black, Hispanic/Latinx, White, Other) using bootstrapped 95% confidence intervals. A local surrogate model was estimated to interpret the classifier's predictions by race and averaged globally from the datasets. Subgroup analyses and post-hoc recalibrations were conducted to attempt to mitigate biased metrics.
We identified bias in the false negative rate (FNR = 0.32) of the Black subgroup compared to the FNR (0.17) of the White subgroup. Top features included "heroin" and "substance abuse" across subgroups. Post-hoc recalibrations eliminated bias in FNR with minimal changes in other subgroup error metrics. The Black FNR subgroup had higher risk scores for readmission and mortality than the White FNR subgroup, and a higher mortality risk score than the Black true positive subgroup (P < .05).
The Black FNR subgroup had the greatest severity of disease and risk for poor outcomes. Similar features were present between subgroups for predicting opioid misuse, but inequities were present. Post-hoc mitigation techniques mitigated bias in type II error rate without creating substantial type I error rates. From model design through deployment, bias and data disadvantages should be systematically addressed.
Standardized, transparent bias assessments are needed to improve trustworthiness in clinical machine learning models.
评估先前验证的机器学习阿片类药物滥用分类器的公平性和偏差。
使用来自两个健康系统的分类器原始(n=1000)和外部验证(n=53974)数据集进行了两项实验。通过在种族/族裔亚组(黑人、西班牙裔/拉丁裔、白人、其他)中测试二类错误率的差异,使用 bootstrap 95%置信区间评估偏差。估计了一个局部替代模型来解释分类器的预测结果,并从数据集中进行了全局平均。进行了亚组分析和事后重新校准,以尝试减轻有偏差的指标。
我们发现黑人亚组的假阴性率(FNR=0.32)与白人亚组的 FNR(0.17)存在偏差。主要特征包括各亚组中的“海洛因”和“药物滥用”。事后重新校准消除了 FNR 中的偏差,而其他亚组错误指标的变化很小。与白人 FNR 亚组相比,黑人 FNR 亚组的再入院和死亡率风险评分更高,死亡率风险评分也高于黑人真阳性亚组(P<0.05)。
黑人 FNR 亚组的疾病严重程度和不良预后风险最高。在预测阿片类药物滥用方面,各亚组之间存在相似的特征,但存在不平等现象。事后缓解技术在不产生大量一类错误率的情况下减轻了二类错误率的偏差。从模型设计到部署,应系统地解决偏差和数据劣势问题。
需要进行标准化、透明的偏差评估,以提高临床机器学习模型的可信度。