Hochheiser Harry, Klug Jesse, Mathie Thomas, Pollard Tom J, Raffa Jesse D, Ballard Stephanie L, Conrad Evamarie A, Edakalavan Smitha, Joseph Allan, Alnomasy Nader, Nutman Sarah, Hill Veronika, Kapoor Sumit, Claudio Eddie Pérez, Kravchenko Olga V, Li Ruoting, Nourelahi Mehdi, Diaz Jenny, Taylor W Michael, Rooney Sydney R, Woeltje Maeve, Celi Leo Anthony, Horvat Christopher M
Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America.
UPMC Intensive Care Unit Service Center, UPMC, Pittsburgh, Pennsylvania, United States of America.
PLOS Digit Health. 2025 Jul 11;4(7):e0000932. doi: 10.1371/journal.pdig.0000932. eCollection 2025 Jul.
To challenge clinicians and informaticians to learn about potential sources of bias in medical machine learning models through investigation of data and predictions from an open-source severity of illness score.
Over a two-day period (total elapsed time approximately 28 hours), we conducted a datathon that challenged interdisciplinary teams to investigate potential sources of bias in the Global Open Source Severity of Illness Score. Teams were invited to develop hypotheses, to use tools of their choosing to identify potential sources of bias, and to provide a final report.
Five teams participated, three of which included both informaticians and clinicians. Most (4/5) used Python for analyses, the remaining team used R. Common analysis themes included relationship of the GOSSIS-1 prediction score with demographics and care related variables; relationships between demographics and outcomes; calibration and factors related to the context of care; and the impact of missingness. Representativeness of the population, differences in calibration and model performance among groups, and differences in performance across hospital settings were identified as possible sources of bias.
Datathons are a promising approach for challenging developers and users to explore questions relating to unrecognized biases in medical machine learning algorithms.
通过对开源疾病严重程度评分的数据和预测进行调查,促使临床医生和信息专家了解医学机器学习模型中潜在的偏差来源。
在为期两天的时间里(总耗时约28小时),我们举办了一场数据马拉松,要求跨学科团队调查全球开源疾病严重程度评分中潜在的偏差来源。邀请各团队提出假设,使用他们选择的工具来识别潜在的偏差来源,并提供一份最终报告。
五个团队参与其中,其中三个团队既有信息专家又有临床医生。大多数团队(4/5)使用Python进行分析,其余团队使用R。常见的分析主题包括GOSSIS-1预测评分与人口统计学和护理相关变量的关系;人口统计学与结果之间的关系;校准以及与护理背景相关的因素;以及数据缺失的影响。人群的代表性、不同组之间校准和模型性能的差异以及不同医院环境下性能的差异被确定为可能的偏差来源。
数据马拉松是一种很有前景的方法,可促使开发者和用户探索与医学机器学习算法中未被识别的偏差相关的问题。