Department of Epidemiology and Biostatistics, School of Public Health, University at Albany, SUNY, One University Place, Rensselaer, NY 12144-3456, USA.
Stat Med. 2011 Dec 20;30(29):3447-60. doi: 10.1002/sim.4355. Epub 2011 Oct 4.
The multivariate normal (MVN) distribution is arguably the most popular parametric model used in imputation and is available in most software packages (e.g., SAS PROC MI, R package norm). When it is applied to categorical variables as an approximation, practitioners often either apply simple rounding techniques for ordinal variables or create a distinct 'missing' category and/or disregard the nominal variable from the imputation phase. All of these practices can potentially lead to biased and/or uninterpretable inferences. In this work, we develop a new rounding methodology calibrated to preserve observed distributions to multiply impute missing categorical covariates. The major attractiveness of this method is its flexibility to use any 'working' imputation software, particularly those based on MVN, allowing practitioners to obtain usable imputations with small biases. A simulation study demonstrates the clear advantage of the proposed method in rounding ordinal variables and, in some scenarios, its plausibility in imputing nominal variables. We illustrate our methods on a widely used National Survey of Children with Special Health Care Needs where incomplete values on race posed a valid threat on inferences pertaining to disparities.
多变量正态(MVN)分布可以说是在插补中使用最广泛的参数模型,并且大多数软件包(例如 SAS PROC MI、R 包 norm)都提供了该模型。当将其应用于分类变量作为近似值时,从业者通常要么对有序变量应用简单的舍入技术,要么创建一个独特的“缺失”类别,并/或在插补阶段忽略名义变量。所有这些做法都可能导致有偏差和/或不可解释的推断。在这项工作中,我们开发了一种新的舍入方法,该方法经过校准,可以保留观察到的分布,以便对缺失的分类协变量进行多重插补。这种方法的主要吸引力在于其灵活性,可以使用任何“工作”的插补软件,特别是基于 MVN 的软件,从而允许从业者以较小的偏差获得可用的插补值。一项模拟研究表明,该方法在舍入有序变量方面具有明显的优势,并且在某些情况下,在对名义变量进行插补方面也具有合理性。我们在广泛使用的具有特殊健康需求的儿童全国调查中说明了我们的方法,其中种族的不完整值对与差异相关的推断构成了合理的威胁。