Lee Katherine J, Carlin John B
Clinical Epidemiology and Biostatistics Unit, Murdoch Childrens Research Institute, The Royal Children's Hospital, Flemington Road, Parkville, VIC, 3052, Australia.
Emerg Themes Epidemiol. 2012 Jun 13;9(1):3. doi: 10.1186/1742-7622-9-3.
Multiple imputation is becoming increasingly popular for handling missing data. However, it is often implemented without adequate consideration of whether it offers any advantage over complete case analysis for the research question of interest, or whether potential gains may be offset by bias from a poorly fitting imputation model, particularly as the amount of missing data increases.
Simulated datasets (n = 1000) drawn from a synthetic population were used to explore information recovery from multiple imputation in estimating the coefficient of a binary exposure variable when various proportions of data (10-90%) were set missing at random in a highly-skewed continuous covariate or in the binary exposure. Imputation was performed using multivariate normal imputation (MVNI), with a simple or zero-skewness log transformation to manage non-normality. Bias, precision, mean-squared error and coverage for a set of regression parameter estimates were compared between multiple imputation and complete case analyses.
For missingness in the continuous covariate, multiple imputation produced less bias and greater precision for the effect of the binary exposure variable, compared with complete case analysis, with larger gains in precision with more missing data. However, even with only moderate missingness, large bias and substantial under-coverage were apparent in estimating the continuous covariate's effect when skewness was not adequately addressed. For missingness in the binary covariate, all estimates had negligible bias but gains in precision from multiple imputation were minimal, particularly for the coefficient of the binary exposure.
Although multiple imputation can be useful if covariates required for confounding adjustment are missing, benefits are likely to be minimal when data are missing in the exposure variable of interest. Furthermore, when there are large amounts of missingness, multiple imputation can become unreliable and introduce bias not present in a complete case analysis if the imputation model is not appropriate. Epidemiologists dealing with missing data should keep in mind the potential limitations as well as the potential benefits of multiple imputation. Further work is needed to provide clearer guidelines on effective application of this method.
多重填补在处理缺失数据方面越来越受欢迎。然而,在实施时,往往没有充分考虑它相对于针对感兴趣的研究问题进行的完整病例分析是否具有任何优势,或者潜在的收益是否可能被拟合不佳的填补模型所产生的偏差所抵消,尤其是随着缺失数据量的增加。
从合成总体中抽取模拟数据集(n = 1000),用于探讨在高度偏态的连续协变量或二元暴露中随机设置不同比例的数据(10 - 90%)缺失时,多重填补在估计二元暴露变量系数时的信息恢复情况。使用多变量正态填补(MVNI)进行填补,并采用简单或零偏态对数变换来处理非正态性。比较了多重填补和完整病例分析在一组回归参数估计的偏差、精度、均方误差和覆盖率。
对于连续协变量中的缺失情况,与完整病例分析相比,多重填补在二元暴露变量的效应方面产生的偏差更小且精度更高,随着缺失数据增多,精度提升更大。然而,即使只有适度的缺失情况,在未充分解决偏态问题时,估计连续协变量的效应时也会出现明显的大偏差和严重的覆盖率不足。对于二元协变量中的缺失情况,所有估计的偏差都可忽略不计,但多重填补在精度方面的提升很小,尤其是对于二元暴露的系数。
虽然如果缺失用于混杂调整所需的协变量,多重填补可能有用,但当感兴趣的暴露变量中存在数据缺失时,益处可能很小。此外,当存在大量缺失数据时,如果填补模型不合适,多重填补可能变得不可靠并引入完整病例分析中不存在的偏差。处理缺失数据的流行病学家应牢记多重填补的潜在局限性以及潜在益处。需要进一步开展工作,以提供关于该方法有效应用的更清晰指南。