• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

预后模型的性能取决于缺失值插补算法的选择:一项模拟研究。

The performance of prognostic models depended on the choice of missing value imputation algorithm: a simulation study.

作者信息

Deforth Manja, Heinze Georg, Held Ulrike

机构信息

Department of Biostatistics at the Epidemiology, Biostatistics and Prevention Institute, University of Zurich, Zurich, Switzerland.

Center for Medical Data Science, Institute of Clinical Biometrics, Medical University of Vienna, Vienna, Austria.

出版信息

J Clin Epidemiol. 2024 Dec;176:111539. doi: 10.1016/j.jclinepi.2024.111539. Epub 2024 Sep 24.

DOI:10.1016/j.jclinepi.2024.111539
PMID:39326470
Abstract

OBJECTIVES

The development of clinical prediction models is often impeded by the occurrence of missing values in the predictors. Various methods for imputing missing values before modeling have been proposed. Some of them are based on variants of multiple imputations by chained equations, while others are based on single imputation. These methods may include elements of flexible modeling or machine learning algorithms, and for some of them user-friendly software packages are available. The aim of this study was to investigate by simulation if some of these methods consistently outperform others in performance measures of clinical prediction models.

STUDY DESIGN AND SETTING

We simulated development and validation cohorts by mimicking observed distributions of predictors and outcome variable of a real data set. In the development cohorts, missing predictor values were created in 36 scenarios defined by the missingness mechanism and proportion of noncomplete cases. We applied three imputation algorithms that were available in R software (R Foundation for Statistical Computing, Vienna, Austria): mice, aregImpute, and missForest. These algorithms differed in their use of linear or flexible models, or random forests, the way of sampling from the predictive posterior distribution, and the generation of a single or multiple imputed data set. For multiple imputation, we also investigated the impact of the number of imputations. Logistic regression models were fitted with the simulated development cohorts before (full data analysis) and after missing value generation (complete case analysis), and with the imputed data. Prognostic model performance was measured by the scaled Brier score, c-statistic, calibration intercept and slope, and by the mean absolute prediction error evaluated in validation cohorts without missing values. Performance of full data analysis was considered as ideal.

RESULTS

None of the imputation methods achieved the model's predictive accuracy that would be obtained in case of no missingness. In general, complete case analysis yielded the worst performance, and deviation from ideal performance increased with increasing percentage of missingness and decreasing sample size. Across all scenarios and performance measures, aregImpute and mice, both with 100 imputations, resulted in highest predictive accuracy. Surprisingly, aregImpute outperformed full data analysis in achieving calibration slopes very close to one across all scenarios and outcome models. The increase of mice's performance with 100 compared to five imputations was only marginal. The differences between the imputation methods decreased with increasing sample sizes and decreasing proportion of noncomplete cases.

CONCLUSION

In our simulation study, model calibration was more affected by the choice of the imputation method than model discrimination. While differences in model performance after using imputation methods were generally small, multiple imputation methods as mice and aregImpute that can handle linear or nonlinear associations between predictors and outcome are an attractive and reliable choice in most situations.

摘要

目的

临床预测模型的开发常常受到预测变量中缺失值出现的阻碍。已经提出了多种在建模前插补缺失值的方法。其中一些基于链式方程多重插补的变体,而其他的则基于单一插补。这些方法可能包括灵活建模或机器学习算法的元素,并且其中一些有用户友好的软件包可用。本研究的目的是通过模拟来调查这些方法中的一些在临床预测模型的性能指标方面是否始终优于其他方法。

研究设计与设置

我们通过模仿一个真实数据集的预测变量和结局变量的观察分布来模拟开发和验证队列。在开发队列中,在由缺失机制和非完整病例比例定义的36种情况下创建缺失的预测变量值。我们应用了R软件(奥地利维也纳的R统计计算基金会)中可用的三种插补算法:mice、aregImpute和missForest。这些算法在使用线性或灵活模型、或随机森林、从预测后验分布进行抽样的方式以及生成单个或多个插补数据集方面有所不同。对于多重插补,我们还研究了插补次数的影响。在缺失值生成之前(完整数据分析)和之后(完整病例分析)以及使用插补数据对模拟的开发队列拟合逻辑回归模型。通过缩放后的Brier评分、c统计量、校准截距和斜率以及在无缺失值的验证队列中评估的平均绝对预测误差来衡量预后模型的性能。完整数据分析的性能被视为理想性能。

结果

没有一种插补方法能达到在无缺失值情况下获得的模型预测准确性。一般来说,完整病例分析产生的性能最差,并且与理想性能的偏差随着缺失值百分比的增加和样本量的减少而增加。在所有情况和性能指标中,进行100次插补的aregImpute和mice产生了最高的预测准确性。令人惊讶的是,在所有情况和结局模型中,aregImpute在实现非常接近1的校准斜率方面优于完整数据分析。与5次插补相比,mice进行100次插补时性能的提升仅微不足道。插补方法之间的差异随着样本量的增加和非完整病例比例的减少而减小。

结论

在我们的模拟研究中,模型校准比模型区分度受插补方法选择的影响更大。虽然使用插补方法后模型性能的差异通常较小,但像mice和aregImpute这样能够处理预测变量与结局之间线性或非线性关联的多重插补方法在大多数情况下是有吸引力且可靠的选择。

相似文献

1
The performance of prognostic models depended on the choice of missing value imputation algorithm: a simulation study.预后模型的性能取决于缺失值插补算法的选择:一项模拟研究。
J Clin Epidemiol. 2024 Dec;176:111539. doi: 10.1016/j.jclinepi.2024.111539. Epub 2024 Sep 24.
2
Does the Presence of Missing Data Affect the Performance of the SORG Machine-learning Algorithm for Patients With Spinal Metastasis? Development of an Internet Application Algorithm.缺失数据的存在是否会影响 SORG 机器学习算法在脊柱转移瘤患者中的性能?开发一种互联网应用算法。
Clin Orthop Relat Res. 2024 Jan 1;482(1):143-157. doi: 10.1097/CORR.0000000000002706. Epub 2023 Jun 12.
3
Imputation and Missing Indicators for Handling Missing Longitudinal Data: Data Simulation Analysis Based on Electronic Health Record Data.处理纵向缺失数据的插补与缺失指示符:基于电子健康记录数据的模拟分析
JMIR Med Inform. 2025 Mar 13;13:e64354. doi: 10.2196/64354.
4
Missing Value Estimation using Clustering and Deep Learning within Multiple Imputation Framework.在多重填补框架内使用聚类和深度学习进行缺失值估计
Knowl Based Syst. 2022 Aug 5;249. doi: 10.1016/j.knosys.2022.108968. Epub 2022 May 10.
5
Are Current Survival Prediction Tools Useful When Treating Subsequent Skeletal-related Events From Bone Metastases?当前的生存预测工具在治疗骨转移后的骨骼相关事件时有用吗?
Clin Orthop Relat Res. 2024 Sep 1;482(9):1710-1721. doi: 10.1097/CORR.0000000000003030. Epub 2024 Mar 22.
6
Multiple imputation with sequential penalized regression.多重插补与序贯惩罚回归。
Stat Methods Med Res. 2019 May;28(5):1311-1327. doi: 10.1177/0962280218755574. Epub 2018 Feb 16.
7
Assessment of predictive performance in incomplete data by combining internal validation and multiple imputation.通过结合内部验证和多重填补来评估不完整数据中的预测性能。
BMC Med Res Methodol. 2016 Oct 26;16(1):144. doi: 10.1186/s12874-016-0239-7.
8
Outcome-sensitive multiple imputation: a simulation study.结果敏感多重填补:一项模拟研究。
BMC Med Res Methodol. 2017 Jan 9;17(1):2. doi: 10.1186/s12874-016-0281-5.
9
Imputation and missing indicators for handling missing data in the development and deployment of clinical prediction models: A simulation study.处理临床预测模型开发和部署中缺失数据的插补和缺失指标:一项模拟研究。
Stat Methods Med Res. 2023 Aug;32(8):1461-1477. doi: 10.1177/09622802231165001. Epub 2023 Apr 27.
10
Generative adversarial networks for imputing missing data for big data clinical research.生成对抗网络在大数据临床研究中用于填补缺失数据。
BMC Med Res Methodol. 2021 Apr 20;21(1):78. doi: 10.1186/s12874-021-01272-3.

引用本文的文献

1
Preoperative identification of the risk factors of cervical lymph node metastasis in medullary thyroid carcinoma.甲状腺髓样癌颈淋巴结转移危险因素的术前识别
Front Endocrinol (Lausanne). 2025 Aug 21;16:1576955. doi: 10.3389/fendo.2025.1576955. eCollection 2025.
2
Imaging-pathology correlation in pancreatic cancer: Methodological considerations and future directions.胰腺癌的影像-病理相关性:方法学考量与未来方向。
World J Gastrointest Oncol. 2025 Jul 15;17(7):103282. doi: 10.4251/wjgo.v17.i7.103282.
3
The role of lipid profile in the relationship between skipping breakfast and hyperuricemia: a moderated mediation model.
血脂水平在不吃早餐与高尿酸血症关系中的作用:一个有调节的中介模型。
BMC Public Health. 2025 Apr 10;25(1):1347. doi: 10.1186/s12889-025-22594-7.