Grobler Anneke C, Lee Katherine
Clinical Epidemiology and Biostatistics Unit, Murdoch Children's Research Institute, Parkville, Victoria, Australia.
Department of Paediatrics, The University of Melbourne, Parkville, Victoria, Australia.
Biom J. 2020 Mar;62(2):467-478. doi: 10.1002/bimj.201900011. Epub 2019 Jul 15.
Multiple imputation (MI) is used to handle missing at random (MAR) data. Despite warnings from statisticians, continuous variables are often recoded into binary variables. With MI it is important that the imputation and analysis models are compatible; variables should be imputed in the same form they appear in the analysis model. With an encoded binary variable more accurate imputations may be obtained by imputing the underlying continuous variable. We conducted a simulation study to explore how best to impute a binary variable that was created from an underlying continuous variable. We generated a completely observed continuous outcome associated with an incomplete binary covariate that is a categorized version of an underlying continuous covariate, and an auxiliary variable associated with the underlying continuous covariate. We simulated data with several sample sizes, and set 25% and 50% of data in the covariate to MAR dependent on the outcome and the auxiliary variable. We compared the performance of five different imputation methods: (a) Imputation of the binary variable using logistic regression; (b) imputation of the continuous variable using linear regression, then categorizing into the binary variable; (c, d) imputation of both the continuous and binary variables using fully conditional specification (FCS) and multivariate normal imputation; (e) substantive-model compatible (SMC) FCS. Bias and standard errors were large when the continuous variable only was imputed. The other methods performed adequately. Imputation of both the binary and continuous variables using FCS often encountered mathematical difficulties. We recommend the SMC-FCS method as it performed best in our simulation studies.
多重填补(MI)用于处理随机缺失(MAR)数据。尽管统计学家发出了警告,但连续变量常常被重新编码为二元变量。对于MI而言,填补模型和分析模型兼容很重要;变量应以其在分析模型中出现的相同形式进行填补。对于编码后的二元变量,通过对潜在的连续变量进行填补可能会获得更准确的填补值。我们进行了一项模拟研究,以探索如何最好地填补由潜在连续变量创建的二元变量。我们生成了一个与不完整二元协变量相关的完全观测到的连续结果,该二元协变量是潜在连续协变量的分类版本,以及一个与潜在连续协变量相关的辅助变量。我们模拟了几种样本量的数据,并根据结果和辅助变量将协变量中25%和50%的数据设置为MAR。我们比较了五种不同填补方法的性能:(a)使用逻辑回归对二元变量进行填补;(b)使用线性回归对连续变量进行填补,然后将其分类为二元变量;(c、d)使用完全条件设定(FCS)和多元正态填补对连续变量和二元变量都进行填补;(e)实质模型兼容(SMC)FCS。仅对连续变量进行填补时,偏差和标准误差较大。其他方法表现良好。使用FCS对二元变量和连续变量都进行填补时常常遇到数学困难。我们推荐SMC-FCS方法,因为它在我们的模拟研究中表现最佳。