Austin Peter C, Lee Douglas S, Wang Bo
ICES, V106, 2075 Bayview Avenue, Toronto, ON, M4N 3M5, Canada.
Department of Health Policy, Management and Evaluation, University of Toronto, Toronto, ON, Canada.
Diagn Progn Res. 2024 Nov 5;8(1):15. doi: 10.1186/s41512-024-00179-z.
Machine learning methods are increasingly being used to predict clinical outcomes. Optimism is the difference in model performance between derivation and validation samples. The term "data hungriness" refers to the sample size needed for a modelling technique to generate a prediction model with minimal optimism. Our objective was to compare the relative data hungriness of different statistical and machine learning methods when assessed using calibration.
We used Monte Carlo simulations to assess the effect of number of events per variable (EPV) on the optimism of six learning methods when assessing model calibration: unpenalized logistic regression, ridge regression, lasso regression, bagged classification trees, random forests, and stochastic gradient boosting machines using trees as the base learners. We performed simulations in two large cardiovascular datasets each of which comprised an independent derivation and validation sample: patients hospitalized with acute myocardial infarction and patients hospitalized with heart failure. We used six data-generating processes, each based on one of the six learning methods. We allowed the sample sizes to be such that the number of EPV ranged from 10 to 200 in increments of 10. We applied six prediction methods in each of the simulated derivation samples and evaluated calibration in the simulated validation samples using the integrated calibration index, the calibration intercept, and the calibration slope. We also examined Nagelkerke's R, the scaled Brier score, and the c-statistic.
Across all 12 scenarios (2 diseases × 6 data-generating processes), penalized logistic regression displayed very low optimism even when the number of EPV was very low. Random forests and bagged trees tended to be the most data hungry and displayed the greatest optimism.
When assessed using calibration, penalized logistic regression was substantially less data hungry than methods from the machine learning literature.
机器学习方法越来越多地用于预测临床结果。乐观度是推导样本与验证样本之间模型性能的差异。术语“数据饥渴度”是指一种建模技术生成具有最小乐观度的预测模型所需的样本量。我们的目的是比较使用校准评估时不同统计和机器学习方法的相对数据饥渴度。
我们使用蒙特卡洛模拟来评估每个变量的事件数(EPV)对六种学习方法在评估模型校准时的乐观度的影响:无惩罚逻辑回归、岭回归、套索回归、袋装分类树、随机森林以及以树为基础学习器的随机梯度提升机。我们在两个大型心血管数据集中进行了模拟,每个数据集都包含一个独立的推导样本和验证样本:急性心肌梗死住院患者和心力衰竭住院患者。我们使用了六个数据生成过程,每个过程基于六种学习方法中的一种。我们设定样本量,使EPV的数量以10为增量从10变化到200。我们在每个模拟推导样本中应用六种预测方法,并使用综合校准指数、校准截距和校准斜率在模拟验证样本中评估校准。我们还检查了Nagelkerke's R、缩放后的Brier评分和c统计量。
在所有12种情况(2种疾病×6种数据生成过程)中,即使EPV数量非常低,惩罚逻辑回归的乐观度也非常低。随机森林和袋装树往往是最需要数据的,并且表现出最大程度的乐观度。
在校准评估时惩罚逻辑回归的数据饥渴度远低于机器学习文献中的方法。