Barton Michael, Hamza Mahmoud, Guevel Borna
Medicine, Harvard Medical School, Boston, USA.
Quantitative Methods, Harvard School of Public Health, Boston, USA.
Cureus. 2023 Feb 15;15(2):e35037. doi: 10.7759/cureus.35037. eCollection 2023 Feb.
Background and objective While the potential of machine learning (ML) in healthcare to positively impact human health continues to grow, the potential for inequity in these methods must be assessed. In this study, we aimed to evaluate the presence of racial bias when five of the most common ML algorithms are used to create models with minimal processing to reduce racial bias. Methods By utilizing a CDC public database, we constructed models for the prediction of healthcare access (binary variable). Using area under the curve (AUC) as our performance metric, we calculated race-specific performance comparisons for each ML algorithm. We bootstrapped our entire analysis 20 times to produce confidence intervals for our AUC performance metrics. Results With the exception of only a few cases, we found that the performance for the White group was, in general, significantly higher than that of the other racial groups across all ML algorithms. Additionally, we found that the most accurate algorithm in our modeling was Extreme Gradient Boosting (XGBoost) followed by random forest, naive Bayes, support vector machine (SVM), and k-nearest neighbors (KNN). Conclusion Our study illustrates the predictive perils of incorporating minimal racial bias mitigation in ML models, resulting in predictive disparities by race. This is particularly concerning in the setting of evidence for limited bias mitigation in healthcare-related ML. There needs to be more conversation, research, and guidelines surrounding methods for racial bias assessment and mitigation in healthcare-related ML models, both those currently used and those in development.
背景与目的 尽管机器学习(ML)在医疗保健领域对人类健康产生积极影响的潜力不断增长,但必须评估这些方法中存在不公平现象的可能性。在本研究中,我们旨在评估当使用五种最常见的ML算法创建经过最少处理以减少种族偏见的模型时,种族偏见的存在情况。方法 通过利用疾病控制与预防中心(CDC)的公共数据库,我们构建了用于预测医疗保健可及性(二元变量)的模型。使用曲线下面积(AUC)作为我们的性能指标,我们计算了每种ML算法按种族划分的性能比较。我们对整个分析进行了20次自助抽样,以生成我们AUC性能指标的置信区间。结果 除了少数情况外,我们发现总体而言,在所有ML算法中,白人组的性能显著高于其他种族组。此外,我们发现在我们的建模中最准确的算法是极端梯度提升(XGBoost),其次是随机森林、朴素贝叶斯、支持向量机(SVM)和k近邻(KNN)。结论 我们的研究说明了在ML模型中纳入最少的种族偏见缓解措施所带来的预测风险,导致了按种族划分的预测差异。在医疗保健相关ML中有限的偏见缓解证据的背景下,这尤其令人担忧。对于医疗保健相关ML模型(包括目前使用的和正在开发的)中的种族偏见评估和缓解方法,需要进行更多的讨论、研究并制定相关指南。