Azizmalayeri Mohammad, Abu-Hanna Ameen, Cinà Giovanni
Department of Medical Informatics, Amsterdam Public Health Research Institute, Amsterdam UMC, University of Amsterdam, the Netherlands.
Department of Medical Informatics, Amsterdam Public Health Research Institute, Amsterdam UMC, University of Amsterdam, the Netherlands; Institute of Logic, Language and Computation, University of Amsterdam, the Netherlands; Pacmed, Amsterdam, the Netherlands.
Int J Med Inform. 2025 Mar;195:105762. doi: 10.1016/j.ijmedinf.2024.105762. Epub 2024 Dec 17.
Machine Learning (ML) models often struggle to generalize effectively to data that deviates from the training distribution. This raises significant concerns about the reliability of real-world healthcare systems encountering such inputs known as out-of-distribution (OOD) data. These concerns can be addressed by real-time detection of OOD inputs. While numerous OOD detection approaches have been suggested in other fields - especially in computer vision - it remains unclear whether similar methods effectively address challenges posed by medical tabular data.
To answer this important question, we propose an extensive reproducible benchmark to compare different OOD detection methods in medical tabular data across a comprehensive suite of tests.
To achieve this, we leverage 4 different and large public medical datasets, including eICU and MIMIC-IV, and consider various kinds of OOD cases within these datasets. For example, we examine OODs originating from a statistically different dataset than the training set according to the membership model introduced by Debray et al. [1], as well as OODs obtained by splitting a given dataset based on a value of a distinguishing variable. To identify OOD instances, we explore a range of 10 density-based methods that learn the marginal distribution of the data, alongside 17 post-hoc detectors that are applied on top of prediction models already trained on the data. The prediction models involve three distinct architectures, namely MLP, ResNet, and Transformer.
In our experiments, when the membership model achieved an AUC of 0.98, which indicated a clear distinction between OOD data and the training set, we observed that the OOD detection methods had achieved AUC values exceeding 0.95 in distinguishing OOD data. In contrast, in the experiments with subtler changes in data distribution such as selecting OOD data based on ethnicity and age characteristics, many OOD detection methods performed similarly to a random classifier with AUC values close to 0.5. This may suggest a correlation between separability, as indicated by the membership model, and OOD detection performance, as indicated by the AUC of the detection model. This warrants future research.
机器学习(ML)模型常常难以有效地泛化到与训练分布不同的数据上。这引发了人们对现实世界中遇到此类输入(即分布外(OOD)数据)的医疗系统可靠性的重大担忧。可以通过实时检测OOD输入来解决这些担忧。虽然在其他领域,尤其是计算机视觉领域,已经提出了许多OOD检测方法,但尚不清楚类似方法是否能有效应对医学表格数据带来的挑战。
为了回答这个重要问题,我们提出了一个广泛的可重现基准,以在一系列全面的测试中比较医学表格数据中不同的OOD检测方法。
为实现这一目标,我们利用4个不同的大型公共医学数据集,包括eICU和MIMIC-IV,并考虑这些数据集中的各种OOD情况。例如,根据德布雷等人[1]引入的隶属模型,我们研究源自与训练集统计上不同的数据集的OOD,以及通过基于区分变量的值分割给定数据集获得的OOD。为了识别OOD实例,我们探索了一系列10种基于密度的方法,这些方法学习数据的边际分布,以及17种事后检测器,这些检测器应用于已经在数据上训练的预测模型之上。预测模型涉及三种不同的架构,即多层感知器(MLP)、残差网络(ResNet)和变换器(Transformer)。
在我们的实验中,当隶属模型的曲线下面积(AUC)达到0.98,这表明OOD数据与训练集之间有明显区别时,我们观察到OOD检测方法在区分OOD数据方面的AUC值超过了0.95。相比之下,在数据分布变化更细微的实验中,例如根据种族和年龄特征选择OOD数据,许多OOD检测方法的表现与随机分类器相似,AUC值接近0.5。这可能表明隶属模型所表明的可分离性与检测模型的AUC所表明的OOD检测性能之间存在相关性。这值得未来进一步研究。