Department of Biomedical Informatics, School of Medicine, Emory Uni versity, United States of America.
Department of Biomedical Informatics, School of Medicine, Emory Uni versity, United States of America.
J Electrocardiol. 2022 Sep-Oct;74:5-9. doi: 10.1016/j.jelectrocard.2022.07.007. Epub 2022 Jul 18.
Despite the recent explosion of machine learning applied to medical data, very few studies have examined algorithmic bias in any meaningful manner, comparing across algorithms, databases, and assessment metrics. In this study, we compared the biases in sex, age, and race of 56 algorithms on over 130,000 electrocardiograms (ECGs) using several metrics and propose a machine learning model design to reduce bias. Participants of the 2021 PhysioNet Challenge designed and implemented working, open-source algorithms to identify clinical diagnosis from 2- lead ECG recordings. We grouped the data from the training, validation, and test datasets by sex (male vs female), age (binned by decade), and race (Asian, Black, White, and Other) whenever possible. We computed recording-wise accuracy, area under the receiver operating characteristic curve (AUROC), area under the precision recall curve (AUPRC), F-measure, and the Challenge Score for each of the 56 algorithms. The Mann-Whitney U and the Kruskal-Wallis tests assessed the performance differences of algorithms across these demographic groups. Group trends revealed similar values for the AUROC, AUPRC, and F-measure for both male and female groups across the training, validation, and test sets. However, recording-wise accuracies were 20% higher (p < 0.01) and the Challenge Score 12% lower (p = 0.02) for female subjects on the test set. AUPRC, F-measure, and the Challenge Score increased with age, while recording-wise accuracy and AUROC decreased with age. The results were similar for the training and test sets, but only recording-wise accuracy (12% decrease per decade, p < 0.01), Challenge Score (1% increase per decade, p < 0.01), and AUROC (1% decrease per decade, p < 0.01) were statistically different on the test set. We observed similar AUROC, AUPRC, Challenge Score, and F-measure values across the different race categories. But, recording-wise accuracies were significantly lower for Black subjects and higher for Asian subjects on the training (31% difference, p < 0.01) and test (39% difference, p < 0.01) sets. A top performing model was then retrained using an additional constraint which simultaneously minimized differences in performance across sex, race and age. This resulted in a modest reduction in performance, with a significant reduction in bias. This work provides a demonstration that biases manifest as a function of model architecture, population, cost function and optimization metric, all of which should be closely examined in any model.
尽管最近机器学习在医学数据中的应用呈爆炸式增长,但很少有研究以有意义的方式比较算法偏差,比较算法、数据库和评估指标。在这项研究中,我们使用了几种指标,比较了 56 种算法在超过 130,000 份心电图 (ECG) 上的性别、年龄和种族偏差,并提出了一种机器学习模型设计来减少偏差。2021 年 PhysioNet 挑战赛的参与者设计并实施了工作的、开源算法,以从 2 导联心电图记录中识别临床诊断。我们尽可能地按性别 (男性与女性)、年龄 (按十年分组) 和种族 (亚洲人、黑人、白人、其他) 将训练、验证和测试数据集的数据分组。我们为 56 种算法中的每一种计算了记录准确性、接收者操作特征曲线下的面积 (AUROC)、精度-召回曲线下的面积 (AUPRC)、F-度量和挑战赛得分。Mann-Whitney U 和 Kruskal-Wallis 检验评估了这些人口统计学组中算法性能的差异。组趋势表明,在训练、验证和测试集上,男女组的 AUROC、AUPRC 和 F-度量值相似。然而,测试集上女性的记录准确率高 20% (p < 0.01),挑战赛得分低 12% (p = 0.02)。AUPRC、F-度量和挑战赛得分随年龄增长而增加,而记录准确率和 AUROC随年龄增长而降低。培训和测试集的结果相似,但仅记录准确率 (每十年降低 12%,p < 0.01)、挑战赛得分 (每十年增加 1%,p < 0.01) 和 AUROC (每十年降低 1%,p < 0.01) 在测试集上有统计学差异。我们在不同种族类别中观察到相似的 AUROC、AUPRC、挑战赛得分和 F-度量值。但是,在训练集和测试集上,黑人的记录准确率明显较低,而亚洲人的记录准确率较高(训练集 31%的差异,p < 0.01;测试集 39%的差异,p < 0.01)。然后,使用一个额外的约束重新训练一个表现良好的模型,该约束同时最小化了性能在性别、种族和年龄方面的差异。这导致性能略有下降,但偏差显著降低。这项工作证明了偏差表现为模型结构、人群、代价函数和优化指标的函数,所有这些都应该在任何模型中进行仔细检查。
J Electrocardiol. 2022
J Med Internet Res. 2023-7-18
Ophthalmol Sci. 2024-7-20
Circulation. 2023-8-29
J Med Internet Res. 2025-1-7
PLOS Digit Health. 2024-7-24
Front Artif Intell. 2024-4-5
Epidemiology. 2014-11
Am J Phys Anthropol. 2009-5
Am Psychol. 1999-12