• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于海量观测数据的患者水平预测的逻辑回归模型:我们需要所有数据吗?

Logistic regression models for patient-level prediction based on massive observational data: Do we need all data?

机构信息

Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands.

Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands.

出版信息

Int J Med Inform. 2022 Jul;163:104762. doi: 10.1016/j.ijmedinf.2022.104762. Epub 2022 Apr 12.

DOI:10.1016/j.ijmedinf.2022.104762
PMID:35429722
Abstract

OBJECTIVE

Provide guidance on sample size considerations for developing predictive models by empirically establishing the adequate sample size, which balances the competing objectives of improving model performance and reducing model complexity as well as computational requirements.

MATERIALS AND METHODS

We empirically assess the effect of sample size on prediction performance and model complexity by generating learning curves for 81 prediction problems (23 outcomes predicted in a depression cohort, 58 outcomes predicted in a hypertension cohort) in three large observational health databases, requiring training of 17,248 prediction models. The adequate sample size was defined as the sample size for which the performance of a model equalled the maximum model performance minus a small threshold value.

RESULTS

The adequate sample size achieves a median reduction of the number of observations of 9.5%, 37.3%, 58.5%, and 78.5% for the thresholds of 0.001, 0.005, 0.01, and 0.02, respectively. The median reduction of the number of predictors in the models was 8.6%, 32.2%, 48.2%, and 68.3% for the thresholds of 0.001, 0.005, 0.01, and 0.02, respectively.

DISCUSSION

Based on our results a conservative, yet significant, reduction in sample size and model complexity can be estimated for future prediction work. Though, if a researcher is willing to generate a learning curve a much larger reduction of the model complexity may be possible as suggested by a large outcome-dependent variability.

CONCLUSION

Our results suggest that in most cases only a fraction of the available data was sufficient to produce a model close to the performance of one developed on the full data set, but with a substantially reduced model complexity.

摘要

目的

通过实证确定合适的样本量,为开发预测模型提供样本量考虑因素的指导,在提高模型性能和降低模型复杂性以及计算需求的竞争目标之间取得平衡。

材料和方法

我们通过在三个大型观察性健康数据库中为 81 个预测问题(在抑郁队列中预测 23 个结果,在高血压队列中预测 58 个结果)生成学习曲线,实证评估样本量对预测性能和模型复杂性的影响,需要训练 17248 个预测模型。适当的样本量定义为模型性能等于最大模型性能减去小阈值的样本量。

结果

对于阈值为 0.001、0.005、0.01 和 0.02,适当的样本量分别使模型的观测数中位数减少了 9.5%、37.3%、58.5%和 78.5%。模型中预测因子的中位数减少了 8.6%、32.2%、48.2%和 68.3%,对于阈值为 0.001、0.005、0.01 和 0.02。

讨论

根据我们的结果,可以估计未来预测工作中样本量和模型复杂性的保守但显著减少。然而,如果研究人员愿意生成学习曲线,则可以根据较大的结果相关变异性,实现模型复杂性的大幅减少。

结论

我们的结果表明,在大多数情况下,仅使用可用数据的一小部分就足以生成接近使用完整数据集开发的模型的模型,但模型复杂性大大降低。

相似文献

1
Logistic regression models for patient-level prediction based on massive observational data: Do we need all data?基于海量观测数据的患者水平预测的逻辑回归模型:我们需要所有数据吗?
Int J Med Inform. 2022 Jul;163:104762. doi: 10.1016/j.ijmedinf.2022.104762. Epub 2022 Apr 12.
2
Sample size considerations and predictive performance of multinomial logistic prediction models.多分类逻辑回归预测模型的样本量考虑因素和预测性能。
Stat Med. 2019 Apr 30;38(9):1601-1619. doi: 10.1002/sim.8063. Epub 2019 Jan 6.
3
Sample size for binary logistic prediction models: Beyond events per variable criteria.二项逻辑预测模型的样本量:超越变量标准的事件数。
Stat Methods Med Res. 2019 Aug;28(8):2455-2474. doi: 10.1177/0962280218784726. Epub 2018 Jul 3.
4
Developing clinical prediction models when adhering to minimum sample size recommendations: The importance of quantifying bootstrap variability in tuning parameters and predictive performance.在遵守最小样本量建议的情况下开发临床预测模型:在调整参数和预测性能时量化引导变异性的重要性。
Stat Methods Med Res. 2021 Dec;30(12):2545-2561. doi: 10.1177/09622802211046388. Epub 2021 Oct 8.
5
Evaluating Modeling and Validation Strategies for Tooth Loss.评估牙齿缺失的建模和验证策略。
J Dent Res. 2019 Sep;98(10):1088-1095. doi: 10.1177/0022034519864889. Epub 2019 Jul 30.
6
Sample size requirements for knowledge-based treatment planning.基于知识的治疗计划的样本量要求。
Med Phys. 2016 Mar;43(3):1212-21. doi: 10.1118/1.4941363.
7
Adequate sample size for developing prediction models is not simply related to events per variable.开发预测模型时,足够的样本量并非仅仅与每个变量的事件数相关。
J Clin Epidemiol. 2016 Aug;76:175-82. doi: 10.1016/j.jclinepi.2016.02.031. Epub 2016 Mar 8.
8
The future of Cochrane Neonatal.考克兰新生儿协作网的未来。
Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12.
9
Effects of dataset size and interactions on the prediction performance of logistic regression and deep learning models.数据集大小和交互作用对逻辑回归和深度学习模型预测性能的影响。
Comput Methods Programs Biomed. 2022 Jan;213:106504. doi: 10.1016/j.cmpb.2021.106504. Epub 2021 Oct 28.
10
Selective thoracic fusion in AIS curves: the definition of target outcomes improves the prediction of spontaneous lumbar curve correction (SLCC).特发性脊柱侧凸(AIS)曲线的选择性胸椎融合:目标结果的定义可改善对自发性腰椎曲线矫正(SLCC)的预测。
Eur Spine J. 2014 Jun;23(6):1263-81. doi: 10.1007/s00586-014-3280-4. Epub 2014 Mar 30.

引用本文的文献

1
How do Chinese people perceive their healthcare system? Inequality in public satisfaction with healthcare security.中国人如何看待他们的医疗体系?公众对医疗保障满意度的不平等。
Front Public Health. 2025 May 1;13:1529964. doi: 10.3389/fpubh.2025.1529964. eCollection 2025.
2
Can we develop real-world prognostic models using observational healthcare data? Large-scale experiment to investigate model sensitivity to database and phenotypes.我们能否利用观察性医疗保健数据开发真实世界的预后模型?调查模型对数据库和表型敏感性的大规模实验。
Diagn Progn Res. 2025 Apr 17;9(1):10. doi: 10.1186/s41512-025-00191-x.
3
Accounting for racial bias and social determinants of health in a model of hypertension control.
在高血压控制模型中考虑种族偏见和健康的社会决定因素。
BMC Med Inform Decis Mak. 2025 Feb 3;25(1):53. doi: 10.1186/s12911-025-02873-4.
4
Finding a constrained number of predictor phenotypes for multiple outcome prediction.为多结果预测寻找数量受限的预测表型。
BMJ Health Care Inform. 2025 Jan 16;32(1):e101227. doi: 10.1136/bmjhci-2024-101227.
5
Development and validation of a patient-level model to predict dementia across a network of observational databases.开发和验证一种基于患者水平的模型,以在一个观察性数据库网络中预测痴呆症。
BMC Med. 2024 Jul 29;22(1):308. doi: 10.1186/s12916-024-03530-9.
6
Comparing penalization methods for linear models on large observational health data.比较大型观测性健康数据中线性模型的惩罚方法。
J Am Med Inform Assoc. 2024 Jun 20;31(7):1514-1521. doi: 10.1093/jamia/ocae109.