Kalaycıoğlu Oya, Pavlou Menelaos, Akhanlı Serhat E, de Belder Mark A, Ambler Gareth, Omar Rumana Z
Department of Biostatistics and Medical Informatics, Bolu Abant İzzet Baysal University, Bolu, Türkiye.
Department of Statistical Science, University College London, London, UK.
Stat Methods Med Res. 2025 Jul;34(7):1356-1372. doi: 10.1177/09622802251338983. Epub 2025 May 14.
Machine learning techniques (MLTs) are increasingly being used to develop clinical risk prediction models for binary health outcomes but the sample size requirements for developing and validating such models remain unclear. This study investigates whether sample size guidelines that target mean absolute prediction error (MAPE) for logistic regression models can be applied to tree-based ensemble MLTs (bagging, random forests, and boosting). Simulations based on two large cardiovascular datasets were used to evaluate the performance of MLTs in terms of MAPE, calibration, the -statistic and Brier score, across six data-generating mechanisms (DGMs) and varying sample sizes. When the DGM and analysis model matched, boosting required a sample size 2-3 times larger than recommended; random forests and bagging did not achieve the target MAPE even with a 12-fold increase. For a neutral DGM that did not match any of the analysis models, logistic regression with only main effects and boosting resulted in target MAPE values with a 12-fold increase in the recommended sample size. For external validation, our simulations showed that sample size guidelines to achieve a target precision of the estimated -statistic were suitable, and thus may be used to inform sample size calculations for MLTs.
机器学习技术(MLTs)越来越多地被用于开发针对二元健康结局的临床风险预测模型,但开发和验证此类模型所需的样本量仍不明确。本研究调查了针对逻辑回归模型的平均绝对预测误差(MAPE)的样本量指南是否可应用于基于树的集成MLTs(装袋法、随机森林和提升法)。基于两个大型心血管数据集的模拟被用于评估MLTs在六种数据生成机制(DGMs)和不同样本量下,在MAPE、校准、-统计量和Brier评分方面的性能。当DGM与分析模型匹配时,提升法所需的样本量比推荐值大2至3倍;即使样本量增加12倍,随机森林和装袋法也未达到目标MAPE。对于与任何分析模型都不匹配的中性DGM,仅包含主效应的逻辑回归和提升法在推荐样本量增加12倍的情况下实现了目标MAPE值。对于外部验证,我们的模拟表明,实现估计的-统计量目标精度的样本量指南是合适的,因此可用于指导MLTs的样本量计算。