选择缺失数据多重插补模型：只用信息准则（IC）！

Selecting the model for multiple imputation of missing data: Just use an IC!

机构信息

Discipline of Biomedical Informatics and Digital Health, The University of Sydney, Sydney, New South Wales, Australia.

School of Mathematics and Statistics, The University of New South Wales, Sydney, New South Wales, Australia.

出版信息

Stat Med. 2021 May 10;40(10):2467-2497. doi: 10.1002/sim.8915. Epub 2021 Feb 24.

DOI:10.1002/sim.8915

PMID:33629367

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8248419/

Abstract

Multiple imputation and maximum likelihood estimation (via the expectation-maximization algorithm) are two well-known methods readily used for analyzing data with missing values. While these two methods are often considered as being distinct from one another, multiple imputation (when using improper imputation) is actually equivalent to a stochastic expectation-maximization approximation to the likelihood. In this article, we exploit this key result to show that familiar likelihood-based approaches to model selection, such as Akaike's information criterion (AIC) and the Bayesian information criterion (BIC), can be used to choose the imputation model that best fits the observed data. Poor choice of imputation model is known to bias inference, and while sensitivity analysis has often been used to explore the implications of different imputation models, we show that the data can be used to choose an appropriate imputation model via conventional model selection tools. We show that BIC can be consistent for selecting the correct imputation model in the presence of missing data. We verify these results empirically through simulation studies, and demonstrate their practicality on two classical missing data examples. An interesting result we saw in simulations was that not only can parameter estimates be biased by misspecifying the imputation model, but also by overfitting the imputation model. This emphasizes the importance of using model selection not just to choose the appropriate type of imputation model, but also to decide on the appropriate level of imputation model complexity.

摘要

多重插补和最大似然估计（通过期望最大化算法）是两种常用于分析含有缺失值数据的知名方法。虽然这两种方法通常被认为彼此不同，但多重插补（当使用不当的插补时）实际上相当于对似然的随机期望最大化逼近。在本文中，我们利用这一关键结果表明，常见的基于似然的模型选择方法，如赤池信息量准则（AIC）和贝叶斯信息量准则（BIC），可用于选择最适合观察数据的插补模型。已知插补模型选择不当会导致推断偏差，尽管敏感性分析常用于探索不同插补模型的影响，但我们表明可以通过传统的模型选择工具利用数据来选择适当的插补模型。我们表明，在存在缺失数据的情况下，BIC 可以一致地选择正确的插补模型。我们通过模拟研究验证了这些结果，并在两个经典的缺失数据示例上演示了其实用性。我们在模拟中看到的一个有趣结果是，不仅参数估计会因指定错误的插补模型而产生偏差，还会因过度拟合插补模型而产生偏差。这强调了使用模型选择不仅要选择适当的插补模型类型，还要决定插补模型复杂度的适当水平的重要性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00f0/8248419/773b68c5451e/SIM-40-2467-g003.jpg

相似文献

Selecting the model for multiple imputation of missing data: Just use an IC!选择缺失数据多重插补模型：只用信息准则（IC）！

Stat Med. 2021 May 10;40(10):2467-2497. doi: 10.1002/sim.8915. Epub 2021 Feb 24.

Performance Evaluation of Missing-Value Imputation Clustering Based on a Multivariate Gaussian Mixture Model.基于多元高斯混合模型的缺失值插补聚类性能评估

PLoS One. 2016 Aug 23;11(8):e0161112. doi: 10.1371/journal.pone.0161112. eCollection 2016.

Variable selection with incomplete covariate data.具有不完全协变量数据的变量选择

Biometrics. 2008 Dec;64(4):1062-9. doi: 10.1111/j.1541-0420.2008.01003.x. Epub 2008 Mar 27.

Model selection in occupancy models: Inference versus prediction.模型选择在占有模型中的应用：推断与预测。

Ecology. 2023 Mar;104(3):e3942. doi: 10.1002/ecy.3942. Epub 2023 Jan 18.

Imputation methods to improve inference in SNP association studies.用于改善单核苷酸多态性关联研究中推断的插补方法。

Genet Epidemiol. 2006 Dec;30(8):690-702. doi: 10.1002/gepi.20180.

SuperMICE: An Ensemble Machine Learning Approach to Multiple Imputation by Chained Equations.超级小鼠：一种基于链式方程的多重填补集成机器学习方法。

Am J Epidemiol. 2022 Feb 19;191(3):516-525. doi: 10.1093/aje/kwab271.

Multiple imputation with sequential penalized regression.多重插补与序贯惩罚回归。

Stat Methods Med Res. 2019 May;28(5):1311-1327. doi: 10.1177/0962280218755574. Epub 2018 Feb 16.

Information criteria for Firth's penalized partial likelihood approach in Cox regression models.Cox回归模型中Firth惩罚偏似然方法的信息准则

Stat Med. 2017 Sep 20;36(21):3422-3436. doi: 10.1002/sim.7368. Epub 2017 Jun 12.

Empirical evaluation of scoring functions for Bayesian network model selection.贝叶斯网络模型选择评分函数的实证评估。

BMC Bioinformatics. 2012;13 Suppl 15(Suppl 15):S14. doi: 10.1186/1471-2105-13-S15-S14. Epub 2012 Sep 11.

Latent class based multiple imputation approach for missing categorical data.基于潜在类别模型的多填补方法处理缺失分类数据

J Stat Plan Inference. 2010 Nov;140(11):3252-3262. doi: 10.1016/j.jspi.2010.04.020.

引用本文的文献

Enhancing representativeness in population-based surveys to improve data quality and decision-making.提高基于人群的调查中的代表性以改善数据质量和决策。

Sci Rep. 2025 Aug 27;15(1):31605. doi: 10.1038/s41598-025-17298-2.

Association between lactate-to-albumin ratio and shortand long-term mortality in critically ill patients with ischemic stroke: A retrospective analysis of the MIMIC-IV database.乳酸与白蛋白比值与缺血性中风重症患者短期和长期死亡率的关联：MIMIC-IV数据库的回顾性分析

J Med Biochem. 2025 Jun 13;44(3):453-469. doi: 10.5937/jomb0-54979.

Association of anaesthesia type with one-year mortality after surgery in elderly patients: a secondary retrospective cohort study.老年患者手术麻醉类型与术后一年死亡率的关联：一项二次回顾性队列研究。

BMC Anesthesiol. 2025 Jul 1;25(1):316. doi: 10.1186/s12871-025-03191-y.

Multiple imputation for systematically missing effect modifiers in individual participant data meta-analysis.个体参与者数据荟萃分析中系统缺失效应修饰因素的多重填补法

Stat Methods Med Res. 2025 Aug;34(8):1590-1604. doi: 10.1177/09622802251348800. Epub 2025 Jun 20.

Incorporation of missing indicator with multiple imputation in propensity score analysis with partially observed covariates: A simulation study.在具有部分观测协变量的倾向得分分析中通过多重填补纳入缺失指标：一项模拟研究。

Stat Methods Med Res. 2025 Jul;34(7):1293-1302. doi: 10.1177/09622802251338365. Epub 2025 Jun 19.

Competing risk nomogram for predicting cancer-specific survival in patients with primary bone diffuse large B-cell lymphoma: a SEER-based retrospective study.预测原发性骨弥漫性大B细胞淋巴瘤患者癌症特异性生存的竞争风险列线图：一项基于监测、流行病学和最终结果（SEER）数据库的回顾性研究

Front Med (Lausanne). 2025 May 12;12:1572919. doi: 10.3389/fmed.2025.1572919. eCollection 2025.

Development and validation of a risk prediction model for autologous arteriovenous fistula thrombosis in patients receiving maintenance hemodialysis.维持性血液透析患者自体动静脉内瘘血栓形成风险预测模型的开发与验证

Ren Fail. 2025 Dec;47(1):2477832. doi: 10.1080/0886022X.2025.2477832. Epub 2025 May 13.

Association of diurnal temperature range and childhood asthma: a population-based cross-sectional study in a Tropical City, China.昼夜温差与儿童哮喘的关联：中国热带城市一项基于人群的横断面研究

BMC Public Health. 2025 Apr 7;25(1):1302. doi: 10.1186/s12889-025-22470-4.

Unraveling the link between physical activity and cognitive function: the mediating impact of depressive symptoms.揭示身体活动与认知功能之间的联系：抑郁症状的中介作用。

BMC Public Health. 2025 Apr 3;25(1):1265. doi: 10.1186/s12889-025-22410-2.

Methods for diagnosing malnutrition in patients with esophageal cancer, and the association with nutritional and inflammatory indices: A cross‑sectional study.食管癌患者营养不良的诊断方法及其与营养和炎症指标的关联：一项横断面研究

Oncol Lett. 2025 Mar 5;29(5):223. doi: 10.3892/ol.2025.14969. eCollection 2025 May.

本文引用的文献

Fractional Brownian motion and multivariate-t models for longitudinal biomedical data, with application to CD4 counts in HIV-positive patients.用于纵向生物医学数据的分数布朗运动和多元t模型及其在HIV阳性患者CD4计数中的应用

Stat Med. 2016 Apr 30;35(9):1514-32. doi: 10.1002/sim.6788. Epub 2015 Nov 10.

The estimation and use of predictions for the assessment of model performance using large samples with multiply imputed data.使用多重填补数据的大样本对模型性能评估进行预测的估计与应用。

Biom J. 2015 Jul;57(4):614-32. doi: 10.1002/bimj.201400004. Epub 2015 Jan 29.

Prognosis of patients with HIV-1 infection starting antiretroviral therapy in sub-Saharan Africa: a collaborative analysis of scale-up programmes.撒哈拉以南非洲地区开始抗逆转录病毒治疗的 HIV-1 感染者的预后：扩大治疗计划的协作分析。

Lancet. 2010 Aug 7;376(9739):449-57. doi: 10.1016/S0140-6736(10)60666-6. Epub 2010 Jul 15.

Model Selection Criteria for Missing-Data Problems Using the EM Algorithm.使用期望最大化（EM）算法解决缺失数据问题的模型选择标准。

J Am Stat Assoc. 2008 Dec 1;103(484):1648-1658. doi: 10.1198/016214508000001057.

Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls.流行病学和临床研究中缺失数据的多重填补：潜力与陷阱

BMJ. 2009 Jun 29;338:b2393. doi: 10.1136/bmj.b2393.

Imputing missing covariate values for the Cox model.为Cox模型估算缺失的协变量值。

Stat Med. 2009 Jul 10;28(15):1982-98. doi: 10.1002/sim.3618.

Estimating HIV incidence in the United States from HIV/AIDS surveillance data and biomarker HIV test results.根据美国艾滋病毒/艾滋病监测数据和生物标志物艾滋病毒检测结果估算艾滋病毒发病率。

Stat Med. 2008 Oct 15;27(23):4617-33. doi: 10.1002/sim.3144.

Missing data analysis: making it work in the real world.缺失数据分析：使其在现实世界中发挥作用。

Annu Rev Psychol. 2009;60:549-76. doi: 10.1146/annurev.psych.58.110405.085530.

How should variable selection be performed with multiply imputed data?对于多重填补的数据，应如何进行变量选择？

Stat Med. 2008 Jul 30;27(17):3227-46. doi: 10.1002/sim.3177.

Sensitivity analysis after multiple imputation under missing at random: a weighting approach.随机缺失情况下多重填补后的敏感性分析：一种加权方法。

Stat Methods Med Res. 2007 Jun;16(3):259-75. doi: 10.1177/0962280206075303.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

选择缺失数据多重插补模型：只用信息准则（IC）！

Selecting the model for multiple imputation of missing data: Just use an IC!

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献