Corbin Conor K, Baiocchi Michael, Chen Jonathan H
Department of Biomedical Data Science, Stanford, California, USA.
Center for Biomedical Informatics Research, Stanford, California, USA.
AMIA Jt Summits Transl Sci Proc. 2023 Jun 16;2023:81-90. eCollection 2023.
When evaluating the performance of clinical machine learning models, one must consider the deployment population. When the population of patients with observed labels is only a subset of the deployment population (label selection), standard model performance estimates on the observed population may be misleading. In this study we describe three classes of label selection and simulate five causally distinct scenarios to assess how particular selection mechanisms bias a suite of commonly reported binary machine learning model performance metrics. Simulations reveal that when selection is affected by observed features, naive estimates of model discrimination may be misleading. When selection is affected by labels, naive estimates of calibration fail to reflect reality. We borrow traditional weighting estimators from causal inference literature and find that when selection probabilities are properly specified, they recover full population estimates. We then tackle the real-world task of monitoring the performance of deployed machine learning models whose interactions with clinicians feed-back and affect the selection mechanism of the labels. We train three machine learning models to flag low-yield laboratory diagnostics, and simulate their intended consequence of reducing wasteful laboratory utilization. We find that naive estimates of AUROC on the observed population undershoot actual performance by up to 20%. Such a disparity could be large enough to lead to the wrongful termination of a successful clinical decision support tool. We propose an altered deployment procedure, one that combines injected randomization with traditional weighted estimates, and find it recovers true model performance.
在评估临床机器学习模型的性能时,必须考虑部署人群。当具有观察标签的患者人群只是部署人群的一个子集(标签选择)时,对观察人群的标准模型性能估计可能会产生误导。在本研究中,我们描述了三类标签选择,并模拟了五种因果关系不同的场景,以评估特定的选择机制如何使一系列常见的二元机器学习模型性能指标产生偏差。模拟结果表明,当选择受观察特征影响时,模型判别力的朴素估计可能会产生误导。当选择受标签影响时,校准的朴素估计无法反映实际情况。我们借鉴因果推断文献中的传统加权估计量,发现当正确指定选择概率时,它们能恢复总体估计。然后,我们处理监测已部署机器学习模型性能的实际任务,这些模型与临床医生的交互会反馈并影响标签的选择机制。我们训练了三个机器学习模型来标记低收益的实验室诊断,并模拟它们减少不必要实验室使用的预期效果。我们发现,对观察人群的AUROC朴素估计比实际性能低达20%。这种差异可能大到足以导致成功的临床决策支持工具被错误终止。我们提出一种改变后的部署程序,即将注入随机化与传统加权估计相结合的程序,并发现它能恢复真实的模型性能。