Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands.
Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands.
Int J Med Inform. 2022 Jul;163:104762. doi: 10.1016/j.ijmedinf.2022.104762. Epub 2022 Apr 12.
Provide guidance on sample size considerations for developing predictive models by empirically establishing the adequate sample size, which balances the competing objectives of improving model performance and reducing model complexity as well as computational requirements.
We empirically assess the effect of sample size on prediction performance and model complexity by generating learning curves for 81 prediction problems (23 outcomes predicted in a depression cohort, 58 outcomes predicted in a hypertension cohort) in three large observational health databases, requiring training of 17,248 prediction models. The adequate sample size was defined as the sample size for which the performance of a model equalled the maximum model performance minus a small threshold value.
The adequate sample size achieves a median reduction of the number of observations of 9.5%, 37.3%, 58.5%, and 78.5% for the thresholds of 0.001, 0.005, 0.01, and 0.02, respectively. The median reduction of the number of predictors in the models was 8.6%, 32.2%, 48.2%, and 68.3% for the thresholds of 0.001, 0.005, 0.01, and 0.02, respectively.
Based on our results a conservative, yet significant, reduction in sample size and model complexity can be estimated for future prediction work. Though, if a researcher is willing to generate a learning curve a much larger reduction of the model complexity may be possible as suggested by a large outcome-dependent variability.
Our results suggest that in most cases only a fraction of the available data was sufficient to produce a model close to the performance of one developed on the full data set, but with a substantially reduced model complexity.
通过实证确定合适的样本量,为开发预测模型提供样本量考虑因素的指导,在提高模型性能和降低模型复杂性以及计算需求的竞争目标之间取得平衡。
我们通过在三个大型观察性健康数据库中为 81 个预测问题(在抑郁队列中预测 23 个结果,在高血压队列中预测 58 个结果)生成学习曲线,实证评估样本量对预测性能和模型复杂性的影响,需要训练 17248 个预测模型。适当的样本量定义为模型性能等于最大模型性能减去小阈值的样本量。
对于阈值为 0.001、0.005、0.01 和 0.02,适当的样本量分别使模型的观测数中位数减少了 9.5%、37.3%、58.5%和 78.5%。模型中预测因子的中位数减少了 8.6%、32.2%、48.2%和 68.3%,对于阈值为 0.001、0.005、0.01 和 0.02。
根据我们的结果,可以估计未来预测工作中样本量和模型复杂性的保守但显著减少。然而,如果研究人员愿意生成学习曲线,则可以根据较大的结果相关变异性,实现模型复杂性的大幅减少。
我们的结果表明,在大多数情况下,仅使用可用数据的一小部分就足以生成接近使用完整数据集开发的模型的模型,但模型复杂性大大降低。