Davis Sharon E, Dorn Chad, Park Daniel J, Matheny Michael E
Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States.
Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37232, United States.
J Am Med Inform Assoc. 2025 May 1;32(5):845-854. doi: 10.1093/jamia/ocaf039.
While performance drift of clinical prediction models is well-documented, the potential for algorithmic biases to emerge post-deployment has had limited characterization. A better understanding of how temporal model performance may shift across subpopulations is required to incorporate fairness drift into model maintenance strategies.
We explore fairness drift in a national population over 11 years, with and without model maintenance aimed at sustaining population-level performance. We trained random forest models predicting 30-day post-surgical readmission, mortality, and pneumonia using 2013 data from US Department of Veterans Affairs facilities. We evaluated performance quarterly from 2014 to 2023 by self-reported race and sex. We estimated discrimination, calibration, and accuracy, and operationalized fairness using metric parity measured as the gap between disadvantaged and advantaged groups.
Our cohort included 1 739 666 surgical cases. We observed fairness drift in both the original and temporally updated models. Model updating had a larger impact on overall performance than fairness gaps. During periods of stable fairness, updating models at the population level increased, decreased, or did not impact fairness gaps. During periods of fairness drift, updating models restored fairness in some cases and exacerbated fairness gaps in others.
This exploratory study highlights that algorithmic fairness cannot be assured through one-time assessments during model development. Temporal changes in fairness may take multiple forms and interact with model updating strategies in unanticipated ways.
Equitable and sustainable clinical artificial intelligence deployments will require novel methods to monitor algorithmic fairness, detect emerging bias, and adopt model updates that promote fairness.
虽然临床预测模型的性能漂移已有充分记录,但算法偏差在部署后出现的可能性却鲜有描述。为了将公平性漂移纳入模型维护策略,需要更好地理解时间模型性能在亚群体间可能如何变化。
我们在11年的全国人群中探索公平性漂移,有无旨在维持人群水平性能的模型维护措施。我们使用美国退伍军人事务部设施2013年的数据训练了随机森林模型,以预测术后30天再入院、死亡率和肺炎。从2014年到2023年,我们按自我报告的种族和性别每季度评估一次性能。我们估计了歧视、校准和准确性,并使用以弱势群体和优势群体之间差距衡量的指标均等化来实现公平性。
我们的队列包括1739666例手术病例。我们在原始模型和时间更新模型中均观察到公平性漂移。模型更新对整体性能的影响大于公平性差距。在公平性稳定期间,在人群水平上更新模型会增加、减少或不影响公平性差距。在公平性漂移期间,更新模型在某些情况下恢复了公平性,而在其他情况下加剧了公平性差距。
这项探索性研究强调,算法公平性不能通过模型开发期间的一次性评估来确保。公平性的时间变化可能有多种形式,并以意想不到的方式与模型更新策略相互作用。
公平和可持续的临床人工智能部署将需要新的方法来监测算法公平性、检测新出现的偏差,并采用促进公平性的模型更新。