Kim Grace Y E, Corbin Conor K, Grolleau François, Baiocchi Michael, Chen Jonathan H
Center for Biomedical Informatics Research, Stanford, CA, USA.
Center for Biomedical Informatics Research, Stanford, CA, USA; Department of Biomedical Data Science, Stanford, CA, USA.
J Biomed Inform. 2025 Aug;168:104854. doi: 10.1016/j.jbi.2025.104854. Epub 2025 Jun 5.
As machine learning adoption in clinical practice continues to grow, deployed classifiers must be continuously monitored and updated (retrained) to protect against data drift that stems from inevitable changes, including evolving medical practices and shifting patient populations. However, successful clinical machine learning classifiers will lead to a change in care which may change the distribution of features, labels, and their relationship. For example, "high risk" cases that were correctly identified by the model may ultimately get labeled as "low risk" thanks to an intervention prompted by the model's alert. Classifier surveillance systems naive to such deployment-induced feedback loops will estimate lower model performance and lead to degraded future classifier retrains. The objective of this study is to simulate the impact of these feedback loops, propose feedback aware monitoring strategies as a solution, and assess the performance of these alternative monitoring strategies through simulations.
We propose Adherence Weighted and Sampling Weighted Monitoring as two feedback loop-aware surveillance strategies. Through simulation we evaluate their ability to accurately appraise post deployment model performance and to initiate safe and accurate classifier retraining.
Measured across accuracy, area under the receiver operating characteristic curve, average precision, brier score, expected calibration error, F1, precision, sensitivity, and specificity, in the presence of feedback loops, Adherence Weighted and Sampling Weighted strategies have the highest fidelity to the ground truth classifier performance while standard approaches yield the most inaccurate estimations. Furthermore, in simulations with true data drift, retraining using standard unweighted approaches results in a AUROC score of 0.52 (drop from 0.72). In contrast, retraining based on Adherence Weighted and Sampling Weighted strategies recover performance to 0.67 which is comparable to what a new model trained from scratch on the existing and shifted data would obtain.
Compared to standard approaches, Adherence Weighted and Sampling Weighted strategies yield more accurate classifier performance estimates, measured according to the no-treatment potential outcome. Retraining based on these strategies bring stronger performance recovery when tested against data drift and feedback loops than do standard approaches.
随着机器学习在临床实践中的应用不断增加,已部署的分类器必须持续进行监测和更新(重新训练),以防范因不可避免的变化(包括不断演变的医疗实践和不断变化的患者群体)导致的数据漂移。然而,成功的临床机器学习分类器会导致护理方式的改变,这可能会改变特征、标签及其关系的分布。例如,由于模型警报引发的干预,模型正确识别的“高风险”病例最终可能会被标记为“低风险”。对这种由部署引起的反馈回路不敏感的分类器监测系统会低估模型性能,并导致未来分类器重新训练的效果下降。本研究的目的是模拟这些反馈回路的影响,提出考虑反馈的监测策略作为解决方案,并通过模拟评估这些替代监测策略的性能。
我们提出依从加权监测和采样加权监测作为两种考虑反馈回路的监测策略。通过模拟,我们评估它们准确评估部署后模型性能以及启动安全准确的分类器重新训练的能力。
在存在反馈回路的情况下,从准确率、接收器操作特征曲线下面积、平均精度、布里尔分数、预期校准误差、F1值、精确率、灵敏度和特异性等方面进行衡量,依从加权策略和采样加权策略对真实分类器性能的保真度最高,而标准方法的估计最不准确。此外,在存在真实数据漂移的模拟中,使用标准无加权方法进行重新训练会导致曲线下面积(AUROC)得分为0.52(从0.72下降)。相比之下,基于依从加权策略和采样加权策略进行重新训练可将性能恢复到0.67,这与在现有和变化后的数据上从头训练的新模型所获得的性能相当。
与标准方法相比,依从加权策略和采样加权策略根据无治疗潜在结果衡量,能产生更准确的分类器性能估计。在针对数据漂移和反馈回路进行测试时,基于这些策略的重新训练比标准方法能带来更强的性能恢复。