Koch Lisa M, Baumgartner Christian F, Berens Philipp
Hertie Institute for AI in Brain Health, University of Tübingen, Tübingen, Germany.
Cluster of Excellence Machine Learning: New Perspectives for Science, University of Tübingen, Tübingen, Germany.
NPJ Digit Med. 2024 May 9;7(1):120. doi: 10.1038/s41746-024-01085-w.
Distribution shifts remain a problem for the safe application of regulated medical AI systems, and may impact their real-world performance if undetected. Postmarket shifts can occur for example if algorithms developed on data from various acquisition settings and a heterogeneous population are predominantly applied in hospitals with lower quality data acquisition or other centre-specific acquisition factors, or where some ethnicities are over-represented. Therefore, distribution shift detection could be important for monitoring AI-based medical products during postmarket surveillance. We implemented and evaluated three deep-learning based shift detection techniques (classifier-based, deep kernel, and multiple univariate kolmogorov-smirnov tests) on simulated shifts in a dataset of 130'486 retinal images. We trained a deep learning classifier for diabetic retinopathy grading. We then simulated population shifts by changing the prevalence of patients' sex, ethnicity, and co-morbidities, and example acquisition shifts by changes in image quality. We observed classification subgroup performance disparities w.r.t. image quality, patient sex, ethnicity and co-morbidity presence. The sensitivity at detecting referable diabetic retinopathy ranged from 0.50 to 0.79 for different ethnicities. This motivates the need for detecting shifts after deployment. Classifier-based tests performed best overall, with perfect detection rates for quality and co-morbidity subgroup shifts at a sample size of 1000. It was the only method to detect shifts in patient sex, but required large sample sizes ( ). All methods identified easier-to-detect out-of-distribution shifts with small (≤300) sample sizes. We conclude that effective tools exist for detecting clinically relevant distribution shifts. In particular classifier-based tests can be easily implemented components in the post-market surveillance strategy of medical device manufacturers.
分布偏移仍然是规范医学人工智能系统安全应用的一个问题,如果未被检测到,可能会影响其在现实世界中的性能。例如,如果基于来自各种采集设置和异质人群的数据开发的算法主要应用于数据采集质量较低或存在其他特定中心采集因素的医院,或者某些种族代表性过高的地方,就可能会出现上市后偏移。因此,分布偏移检测对于上市后监测基于人工智能的医疗产品可能很重要。我们在一个包含130486张视网膜图像的数据集上,针对模拟偏移实施并评估了三种基于深度学习的偏移检测技术(基于分类器的、深度核和多元单变量柯尔莫哥洛夫-斯米尔诺夫检验)。我们训练了一个用于糖尿病视网膜病变分级的深度学习分类器。然后,我们通过改变患者的性别、种族和合并症患病率来模拟人群偏移,并通过改变图像质量来模拟示例采集偏移。我们观察到在图像质量、患者性别、种族和合并症存在方面的分类亚组性能差异。对于不同种族,检测可参考糖尿病视网膜病变的灵敏度范围为0.50至0.79。这激发了在部署后检测偏移的需求。基于分类器的测试总体表现最佳,在样本量为1000时,对质量和合并症亚组偏移的检测率完美。它是唯一能检测患者性别偏移的方法,但需要大样本量( )。所有方法在小样本量(≤300)时都能识别出更容易检测的分布外偏移。我们得出结论,存在用于检测临床相关分布偏移的有效工具。特别是基于分类器的测试可以很容易地成为医疗设备制造商上市后监测策略中可实施的组件。