Reilly M, Pepe M
Department of Statistics, University College Dublin, Belfield, Ireland.
Stat Med. 1997;16(1-3):5-19. doi: 10.1002/(sici)1097-0258(19970115)16:1<5::aid-sim469>3.0.co;2-8.
Hot-deck imputation is an intuitively simple and popular method of accommodating incomplete data. Users of the method will often use the usual multiple imputation variance estimator which is not appropriate in this case. However, no variance expression has yet been derived for this easily implemented method applied to missing covariates in regression models. The simple hot-deck method is in fact asymptotically equivalent to the mean-score method for the estimation of a regression model parameter, so that hot-deck can be understood in the context of likelihood methods. Both of these methods accommodate data where missingness may depend on the observed variables but not on the unobserved value of the incomplete covariate, that is, missing at random (MAR). The asymptotic properties of hot-deck are derived here for the case where the fully observed variables are categorical, though the incomplete covariate(s) may be continuous. Simulation studies indicate that the two methods compare well in small samples and for small numbers of imputations. Current users of hot-deck may now conduct their analysis using mean-score, which is a weighted likelihood method and can thus be implemented by a single pass through the data using any standard package which accommodates weighted regression models. Valid inference is now straightforward using the variance expression provided here. The equivalence of mean-score and hot-deck is illustrated using three clinical data sets where an important covariate is missing for a large number of study subjects.
热卡填补是一种直观简单且流行的处理不完全数据的方法。该方法的使用者常常会使用通常的多重填补方差估计量,而这在这种情况下并不合适。然而,对于应用于回归模型中缺失协变量的这种易于实施的方法,尚未推导出方差表达式。事实上,简单热卡方法在渐近意义上等同于用于估计回归模型参数的均值得分方法,这样热卡方法就可以在似然方法的背景下被理解。这两种方法都适用于缺失情况可能依赖于观测变量但不依赖于不完全协变量的未观测值的数据,也就是说,随机缺失(MAR)。本文针对完全观测变量为分类变量的情况推导了热卡方法的渐近性质,尽管不完全协变量可能是连续的。模拟研究表明,这两种方法在小样本和少量填补情况下表现相当。热卡方法的现有使用者现在可以使用均值得分进行分析,均值得分是一种加权似然方法,因此可以通过使用任何适用于加权回归模型的标准软件包对数据进行单次遍历来实现。现在使用本文提供的方差表达式进行有效推断很简单。使用三个临床数据集说明了均值得分和热卡方法的等价性,在这些数据集中,大量研究对象缺失一个重要协变量。