Oakden-Rayner Luke, Dunnmon Jared, Carneiro Gustavo, Ré Christopher
Australian Institute for Machine Learning, University of Adelaide, Adelaide, Australia.
Department of Computer Science, Stanford University, Stanford, California, USA.
Proc ACM Conf Health Inference Learn (2020). 2020 Apr;2020:151-159. doi: 10.1145/3368555.3384468.
Machine learning models for medical image analysis often suffer from poor performance on important subsets of a population that are not identified during training or testing. For example, overall performance of a cancer detection model may be high, but the model may still consistently miss a rare but aggressive cancer subtype. We refer to this problem as , and observe that it results from incompletely describing the meaningful variation in a dataset. While hidden stratification can substantially reduce the clinical efficacy of machine learning models, its effects remain difficult to measure. In this work, we assess the utility of several possible techniques for measuring hidden stratification effects, and characterize these effects both via synthetic experiments on the CIFAR-100 benchmark dataset and on multiple real-world medical imaging datasets. Using these measurement techniques, we find evidence that hidden stratification can occur in unidentified imaging subsets with low prevalence, low label quality, subtle distinguishing features, or spurious correlates, and that it can result in relative performance differences of over 20% on clinically important subsets. Finally, we discuss the clinical implications of our findings, and suggest that evaluation of hidden stratification should be a critical component of any machine learning deployment in medical imaging.
用于医学图像分析的机器学习模型在训练或测试期间未被识别的人群重要子集中,往往表现不佳。例如,癌症检测模型的整体性能可能很高,但该模型仍可能持续漏诊一种罕见但侵袭性强的癌症亚型。我们将这个问题称为 ,并观察到它是由对数据集中有意义的变异描述不完整导致的。虽然隐藏分层会大幅降低机器学习模型的临床疗效,但其影响仍难以衡量。在这项工作中,我们评估了几种用于测量隐藏分层效应的可能技术的效用,并通过在CIFAR - 100基准数据集上的合成实验以及在多个真实世界医学成像数据集上,对这些效应进行了表征。使用这些测量技术,我们发现有证据表明,隐藏分层可能出现在患病率低、标签质量低、具有细微区分特征或虚假相关性的未识别成像子集中,并且它可能导致在临床重要子集中的相对性能差异超过20%。最后,我们讨论了我们研究结果的临床意义,并建议对隐藏分层的评估应成为医学成像中任何机器学习部署的关键组成部分。