Discipline of Biomedical Informatics and Digital Health, The University of Sydney, Sydney, New South Wales, Australia.
School of Mathematics and Statistics, The University of New South Wales, Sydney, New South Wales, Australia.
Stat Med. 2021 May 10;40(10):2467-2497. doi: 10.1002/sim.8915. Epub 2021 Feb 24.
Multiple imputation and maximum likelihood estimation (via the expectation-maximization algorithm) are two well-known methods readily used for analyzing data with missing values. While these two methods are often considered as being distinct from one another, multiple imputation (when using improper imputation) is actually equivalent to a stochastic expectation-maximization approximation to the likelihood. In this article, we exploit this key result to show that familiar likelihood-based approaches to model selection, such as Akaike's information criterion (AIC) and the Bayesian information criterion (BIC), can be used to choose the imputation model that best fits the observed data. Poor choice of imputation model is known to bias inference, and while sensitivity analysis has often been used to explore the implications of different imputation models, we show that the data can be used to choose an appropriate imputation model via conventional model selection tools. We show that BIC can be consistent for selecting the correct imputation model in the presence of missing data. We verify these results empirically through simulation studies, and demonstrate their practicality on two classical missing data examples. An interesting result we saw in simulations was that not only can parameter estimates be biased by misspecifying the imputation model, but also by overfitting the imputation model. This emphasizes the importance of using model selection not just to choose the appropriate type of imputation model, but also to decide on the appropriate level of imputation model complexity.
多重插补和最大似然估计(通过期望最大化算法)是两种常用于分析含有缺失值数据的知名方法。虽然这两种方法通常被认为彼此不同,但多重插补(当使用不当的插补时)实际上相当于对似然的随机期望最大化逼近。在本文中,我们利用这一关键结果表明,常见的基于似然的模型选择方法,如赤池信息量准则(AIC)和贝叶斯信息量准则(BIC),可用于选择最适合观察数据的插补模型。已知插补模型选择不当会导致推断偏差,尽管敏感性分析常用于探索不同插补模型的影响,但我们表明可以通过传统的模型选择工具利用数据来选择适当的插补模型。我们表明,在存在缺失数据的情况下,BIC 可以一致地选择正确的插补模型。我们通过模拟研究验证了这些结果,并在两个经典的缺失数据示例上演示了其实用性。我们在模拟中看到的一个有趣结果是,不仅参数估计会因指定错误的插补模型而产生偏差,还会因过度拟合插补模型而产生偏差。这强调了使用模型选择不仅要选择适当的插补模型类型,还要决定插补模型复杂度的适当水平的重要性。
Stat Med. 2021-5-10
Biometrics. 2008-12
Genet Epidemiol. 2006-12
Am J Epidemiol. 2022-2-19
Stat Methods Med Res. 2018-2-16
BMC Bioinformatics. 2012-9-11
J Stat Plan Inference. 2010-11
Stat Methods Med Res. 2025-8
J Am Stat Assoc. 2008-12-1
Stat Med. 2009-7-10
Annu Rev Psychol. 2009
Stat Med. 2008-7-30
Stat Methods Med Res. 2007-6