Jafrasteh Bahram, Adeli Ehsan, Pohl Kilian M, Kuceyeski Amy, Sabuncu Mert R, Zhao Qingyu
Department of Radiology, Weill Cornell Medicine, New York, NY, USA.
Department of Psychiatry & Behavioral Sciences, Stanford University, Stanford, CA, USA.
Sci Rep. 2025 Aug 6;15(1):28745. doi: 10.1038/s41598-025-12026-2.
Machine learning (ML) has significantly transformed biomedical research, leading to a growing interest in model development to advance classification accuracy in various clinical applications. However, this progress raises essential questions regarding how to rigorously compare the accuracy of different ML models. In this study, we highlight the practical challenges in quantifying the statistical significance of accuracy differences between two neuroimaging-based classification models when cross-validation (CV) is performed. Specifically, we propose an unbiased framework to assess the impact of CV setups (e.g., the number of folds) on the statistical significance. We apply this framework to three publicly available neuroimaging datasets to re-emphasize known flaws in current computation of p-values for comparing model accuracies. We further demonstrate that the likelihood of detecting significant differences among models varies substantially with the intrinsic properties of the data, testing procedures, and CV configurations of choice. Given that many of the above factors do not typically fall into the evaluation criteria of ML-based biomedical studies, we argue that such variability can potentially lead to p-hacking and inconsistent conclusions on model improvement. The obtained results from this study underscore that more rigorous practices in model comparison are urgently needed in order to mitigate the reproducibility crisis in biomedical ML research.
机器学习(ML)已经显著改变了生物医学研究,引发了人们对模型开发的日益浓厚的兴趣,以提高各种临床应用中的分类准确性。然而,这一进展引发了关于如何严格比较不同ML模型准确性的重要问题。在本研究中,我们强调了在进行交叉验证(CV)时,量化基于两种神经影像的分类模型之间准确性差异的统计显著性所面临的实际挑战。具体而言,我们提出了一个无偏框架,以评估CV设置(例如,折数)对统计显著性的影响。我们将此框架应用于三个公开可用的神经影像数据集,以再次强调当前用于比较模型准确性的p值计算中存在的已知缺陷。我们进一步证明,检测模型之间显著差异的可能性会因数据的内在属性、测试程序和所选的CV配置而有很大差异。鉴于上述许多因素通常不属于基于ML的生物医学研究的评估标准,我们认为这种变异性可能会导致p值操纵以及关于模型改进的不一致结论。本研究获得的结果强调,迫切需要在模型比较中采用更严格的方法,以缓解生物医学ML研究中的可重复性危机。