Gallitto Giuseppe, Englert Robert, Kincses Balint, Kotikalapudi Raviteja, Li Jialin, Hoffschlag Kevin, Bingel Ulrike, Spisak Tamas
Center for Translational Neuro- and Behavioral Sciences (C-TNBS), University Medicine Essen, Hufelandstraße 55, 45147, Essen, Germany.
Department of Neurology, University Medicine Essen, Hufelandstraße 55, 45147, Essen, Germany.
Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf036.
Multivariate predictive models play a crucial role in enhancing our understanding of complex biological systems and in developing innovative, replicable tools for translational medical research. However, the complexity of machine learning methods and extensive data preprocessing and feature engineering pipelines can lead to overfitting and poor generalizability. An unbiased evaluation of predictive models necessitates external validation, which involves testing the finalized model on independent data. Despite its importance, external validation is often neglected in practice due to the associated costs.
Here we propose that, for maximal credibility, model discovery and external validation should be separated by the public disclosure (e.g., preregistration) of feature processing steps and model weights. Furthermore, we introduce a novel approach to optimize the trade-off between efforts spent on model discovery and external validation in such studies. We show on data involving more than 3,000 participants from four different datasets that, for any "sample size budget," the proposed adaptive splitting approach can successfully identify the optimal time to stop model discovery so that predictive performance is maximized without risking a low-powered, and thus inconclusive, external validation.
The proposed design and splitting approach (implemented in the Python package "AdaptiveSplit") may contribute to addressing issues of replicability, effect size inflation, and generalizability in predictive modeling studies.
多变量预测模型在增强我们对复杂生物系统的理解以及开发用于转化医学研究的创新、可复制工具方面发挥着关键作用。然而,机器学习方法的复杂性以及广泛的数据预处理和特征工程流程可能导致过拟合和较差的泛化能力。对预测模型进行无偏评估需要外部验证,这涉及在独立数据上测试最终确定的模型。尽管其很重要,但由于相关成本,外部验证在实践中常常被忽视。
在此我们提出,为了获得最大可信度,模型发现和外部验证应通过公开披露(例如预注册)特征处理步骤和模型权重来分开。此外,我们引入了一种新颖的方法来优化此类研究中在模型发现和外部验证上所花费精力之间的权衡。我们在来自四个不同数据集的3000多名参与者的数据上表明,对于任何“样本量预算”,所提出的自适应分割方法都可以成功识别停止模型发现的最佳时间,从而在不冒外部验证功效低(因而无定论)风险的情况下使预测性能最大化。
所提出的设计和分割方法(在Python包“AdaptiveSplit”中实现)可能有助于解决预测建模研究中的可重复性、效应量膨胀和泛化能力等问题。