Kwon Hyunjin, Greenberg Matthew, Josephson Colin Bruce, Lee Joon
Department of Biomedical Engineering, Schulich School of Engineering, University of Calgary, Calgary, Alberta, Canada.
Data Intelligence for Health Lab, Cumming School of Medicine, University of Calgary, Alberta, Canada.
Sci Rep. 2024 May 7;14(1):10474. doi: 10.1038/s41598-024-61284-z.
Different levels of prediction difficulty are one of the key factors that researchers encounter when applying machine learning to data. Although previous studies have introduced various metrics for assessing the prediction difficulty of individual cases, these metrics require specific dataset preconditions. In this paper, we propose three novel metrics for measuring the prediction difficulty of individual cases using fully-connected feedforward neural networks. The first metric is based on the complexity of the neural network needed to make a correct prediction. The second metric employs a pair of neural networks: one makes a prediction for a given case, and the other predicts whether the prediction made by the first model is likely to be correct. The third metric assesses the variability of the neural network's predictions. We investigated these metrics using a variety of datasets, visualized their values, and compared them to fifteen existing metrics from the literature. The results demonstrate that the proposed case difficulty metrics were better able to differentiate various levels of difficulty than most of the existing metrics and show constant effectiveness across diverse datasets. We expect our metrics will provide researchers with a new perspective on understanding their datasets and applying machine learning in various fields.
不同程度的预测难度是研究人员在将机器学习应用于数据时遇到的关键因素之一。尽管先前的研究已经引入了各种指标来评估单个案例的预测难度,但这些指标需要特定的数据集前提条件。在本文中,我们提出了三种新颖的指标,用于使用全连接前馈神经网络来衡量单个案例的预测难度。第一个指标基于做出正确预测所需的神经网络的复杂性。第二个指标采用一对神经网络:一个对给定案例进行预测,另一个预测第一个模型做出的预测是否可能正确。第三个指标评估神经网络预测的可变性。我们使用各种数据集对这些指标进行了研究,直观显示了它们的值,并将它们与文献中的十五个现有指标进行了比较。结果表明,所提出的案例难度指标比大多数现有指标更能区分不同程度的难度,并且在不同的数据集中都表现出持续的有效性。我们期望我们的指标将为研究人员提供一个新的视角,以理解他们的数据集并在各个领域应用机器学习。