Department of Respiratory Medicine, Medway NHS Foundation Trust, Gillingham, Kent, UK; Department of Critical Care, Medway NHS Foundation Trust, Gillingham, Kent, UK; Faculty of Life Sciences, King's College London, London, UK.
UCL Institute for Health Informatics, London, UK; Crystallise, Essex, UK.
Lancet Digit Health. 2021 Apr;3(4):e241-e249. doi: 10.1016/S2589-7500(21)00022-4.
Despite wide use of severity scoring systems for case-mix determination and benchmarking in the intensive care unit (ICU), the possibility of scoring bias across ethnicities has not been examined. Guidelines on the use of illness severity scores to inform triage decisions for allocation of scarce resources, such as mechanical ventilation, during the current COVID-19 pandemic warrant examination for possible bias in these models. We investigated the performance of the severity scoring systems Acute Physiology and Chronic Health Evaluation IVa (APACHE IVa), Oxford Acute Severity of Illness Score (OASIS), and Sequential Organ Failure Assessment (SOFA) across four ethnicities in two large ICU databases to identify possible ethnicity-based bias.
Data from the electronic ICU Collaborative Research Database (eICU-CRD) and the Medical Information Mart for Intensive Care III (MIMIC-III) database, built from patient episodes in the USA from 2014-15 and 2001-12, respectively, were analysed for score performance in Asian, Black, Hispanic, and White people after appropriate exclusions. Hospital mortality was the outcome of interest. Discrimination and calibration were determined for all three scoring systems in all four groups, using area under receiver operating characteristic (AUROC) curve for different ethnicities to assess discrimination, and standardised mortality ratio (SMR) or proxy measures to assess calibration.
We analysed 166 751 participants (122 919 eICU-CRD and 43 832 MIMIC-III). Although measurements of discrimination were significantly different among the groups (AUROC ranging from 0·86 to 0·89 [p=0·016] with APACHE IVa and from 0·75 to 0·77 [p=0·85] with OASIS), they did not display any discernible systematic patterns of bias. However, measurements of calibration indicated persistent, and in some cases statistically significant, patterns of difference between Hispanic people (SMR 0·73 with APACHE IVa and 0·64 with OASIS) and Black people (0·67 and 0·68) versus Asian people (0·77 and 0·95) and White people (0·76 and 0·81). Although calibrations were imperfect for all groups, the scores consistently showed a pattern of overpredicting mortality for Black people and Hispanic people. Similar results were seen using SOFA scores across the two databases.
The systematic differences in calibration across ethnicities suggest that illness severity scores reflect statistical bias in their predictions of mortality.
There was no specific funding for this study.
尽管严重程度评分系统被广泛用于确定重症监护病房(ICU)的病例组合并进行基准测试,但尚未检查其在种族之间存在评分偏差的可能性。有关使用疾病严重程度评分来告知分诊决策的指南,以在当前的 COVID-19 大流行期间分配稀缺资源,例如机械通气,需要检查这些模型中可能存在的偏差。我们调查了严重程度评分系统急性生理学和慢性健康评估 IVa(APACHE IVa)、牛津急性严重程度评分(OASIS)和序贯器官衰竭评估(SOFA)在两个大型 ICU 数据库中的四个种族中的表现,以确定可能存在基于种族的偏差。
从 2014-15 年和 2001-12 年美国电子 ICU 协作研究数据库(eICU-CRD)和医疗信息集市 III(MIMIC-III)数据库中患者的病例中分别分析了数据,在适当排除后,对亚洲、黑人、西班牙裔和白人人群中的评分表现进行了分析。住院死亡率是研究的结果。使用不同种族的接收器工作特征(ROC)曲线下面积(AUROC)来评估差异,使用标准化死亡率(SMR)或代理指标来评估校准,以评估所有三种评分系统在所有四个组中的区分度和校准度。
我们分析了 166751 名参与者(122919 名 eICU-CRD 和 43832 名 MIMIC-III)。尽管组间的区分度测量值明显不同(APACHE IVa 的 AUROC 范围为 0.86 至 0.89[ p=0.016],OASIS 的 AUROC 范围为 0.75 至 0.77[ p=0.85]),但它们并没有显示出任何可识别的系统偏差模式。然而,校准度的测量值表明,西班牙裔(APACHE IVa 的 SMR 为 0.73,OASIS 的 SMR 为 0.64)和黑人(0.67 和 0.68)与亚洲人(0.77 和 0.95)和白人(0.76 和 0.81)之间存在持续的、在某些情况下具有统计学意义的差异模式。尽管所有组的校准都不完美,但评分系统始终显示出对黑人和西班牙裔人群死亡率的过度预测模式。在这两个数据库中使用 SOFA 评分也得到了类似的结果。
种族之间校准的系统差异表明,疾病严重程度评分在其死亡率预测中反映了统计偏差。
本研究没有特定的资金来源。