Lu Jonathan, Sattler Amelia, Wang Samantha, Khaki Ali Raza, Callahan Alison, Fleming Scott, Fong Rebecca, Ehlert Benjamin, Li Ron C, Shieh Lisa, Ramchandran Kavitha, Gensheimer Michael F, Chobot Sarah, Pfohl Stephen, Li Siyun, Shum Kenny, Parikh Nitin, Desai Priya, Seevaratnam Briththa, Hanson Melanie, Smith Margaret, Xu Yizhe, Gokhale Arjun, Lin Steven, Pfeffer Michael A, Teuteberg Winifred, Shah Nigam H
Center for Biomedical Informatics Research, Department of Medicine, Stanford University School of Medicine, Palo Alto, United States.
Stanford Healthcare AI Applied Research Team, Division of Primary Care and Population Health, Department of Medicine, Stanford University School of Medicine, Palo Alto, United States.
Front Digit Health. 2022 Sep 12;4:943768. doi: 10.3389/fdgth.2022.943768. eCollection 2022.
Multiple reporting guidelines for artificial intelligence (AI) models in healthcare recommend that models be audited for reliability and fairness. However, there is a gap of operational guidance for performing reliability and fairness audits in practice. Following guideline recommendations, we conducted a reliability audit of two models based on model performance and calibration as well as a fairness audit based on summary statistics, subgroup performance and subgroup calibration. We assessed the Epic End-of-Life (EOL) Index model and an internally developed Stanford Hospital Medicine (HM) Advance Care Planning (ACP) model in 3 practice settings: Primary Care, Inpatient Oncology and Hospital Medicine, using clinicians' answers to the surprise question ("Would you be surprised if [patient X] passed away in [Y years]?") as a surrogate outcome. For performance, the models had positive predictive value (PPV) at or above 0.76 in all settings. In Hospital Medicine and Inpatient Oncology, the Stanford HM ACP model had higher sensitivity (0.69, 0.89 respectively) than the EOL model (0.20, 0.27), and better calibration (O/E 1.5, 1.7) than the EOL model (O/E 2.5, 3.0). The Epic EOL model flagged fewer patients (11%, 21% respectively) than the Stanford HM ACP model (38%, 75%). There were no differences in performance and calibration by sex. Both models had lower sensitivity in Hispanic/Latino male patients with Race listed as "Other." 10 clinicians were surveyed after a presentation summarizing the audit. 10/10 reported that summary statistics, overall performance, and subgroup performance would affect their decision to use the model to guide care; 9/10 said the same for overall and subgroup calibration. The most commonly identified barriers for routinely conducting such reliability and fairness audits were poor demographic data quality and lack of data access. This audit required 115 person-hours across 8-10 months. Our recommendations for performing reliability and fairness audits include verifying data validity, analyzing model performance on intersectional subgroups, and collecting clinician-patient linkages as necessary for label generation by clinicians. Those responsible for AI models should require such audits before model deployment and mediate between model auditors and impacted stakeholders.
针对医疗保健领域人工智能(AI)模型的多项报告指南建议,应对模型进行可靠性和公平性审核。然而,在实际操作中,进行可靠性和公平性审核的操作指南存在空白。按照指南建议,我们基于模型性能和校准对两个模型进行了可靠性审核,并基于汇总统计、亚组性能和亚组校准进行了公平性审核。我们在3种实践环境中评估了Epic临终(EOL)指数模型和一个内部开发的斯坦福医院医学(HM)的提前护理规划(ACP)模型,这3种实践环境分别是初级保健、住院肿瘤学和医院医学,我们将临床医生对意外问题(“如果[患者X]在[Y年]内去世,你会感到意外吗?”)的回答作为替代结果。在性能方面,这些模型在所有环境中的阳性预测值(PPV)均达到或高于0.76。在医院医学和住院肿瘤学环境中,斯坦福HM ACP模型的灵敏度(分别为0.69和0.89)高于EOL模型(分别为0.20和0.27),并且校准效果(观察值与预期值之比为1.5和1.7)优于EOL模型(观察值与预期值之比为2.5和3.0)。Epic EOL模型标记的患者(分别为11%和21%)少于斯坦福HM ACP模型(分别为38%和75%)。按性别划分,性能和校准方面没有差异。在种族列为“其他”的西班牙裔/拉丁裔男性患者中,两个模型的灵敏度都较低。在一次总结审核情况的报告会后,对10名临床医生进行了调查。10/10的受访者表示,汇总统计、总体性能和亚组性能会影响他们使用该模型指导护理的决定;9/10的受访者对总体和亚组校准也表示认同。常规进行此类可靠性和公平性审核最常见的障碍是人口统计数据质量差和缺乏数据访问权限。此次审核在8 - 10个月内共需要115人时。我们对进行可靠性和公平性审核的建议包括验证数据有效性、分析交叉亚组上的模型性能以及在必要时收集临床医生与患者的关联信息以便临床医生生成标签。负责AI模型的人员应在模型部署前要求进行此类审核,并在模型审核人员和受影响的利益相关者之间进行协调。