Siddique Juned, Belin Thomas R
Department of Health Studies, University of Chicago, Chicago, IL 60637, U.S.A.
Stat Med. 2008 Jan 15;27(1):83-102. doi: 10.1002/sim.3001.
Hot-deck imputation offers advantages in reflecting salient features of data distributions in missing-data problems, but previous implementations have lacked the appeal associated with modern Bayesian statistical-computing techniques. We outline a strategy of iterative hot-deck multiple imputation with distance-based donor selection. With distance defined as a monotonic function of the difference in predictive means between cases, donors are chosen with probability inversely proportional to their distance from the donee. This method retains the implementation ease of ad hoc techniques, while incorporating the desirable features of Bayesian approaches. Special cases of our method include nearest-neighbor imputation and a simple random hot-deck. Iterating the procedure provides an analogy to Markov Chain Monte Carlo methods and is intended to mitigate dependence on starting values. Results from imputing missing values in a longitudinal depression treatment trial as well as a simulation study are presented. We evaluate how different definitions of distance, choices of starting values, the order in which variables are chosen for imputation, and the number of iterations impact inferences. We show that our measure of distance controls the tradeoff between bias and variance of our estimates. We find that inferences from the depression treatment trial are not sensitive to most definitions of distance. In addition, while differences exist between 1 iteration and 10 iterations, there are no meaningful differences between inferences based on 10 iterations and those based on 500 iterations. The choice of starting value did not have an impact on inferences but the order in which the variables were chosen for imputation was significant even after iteration.
热卡填充法在反映缺失数据问题中数据分布的显著特征方面具有优势,但先前的实现方式缺乏与现代贝叶斯统计计算技术相关的吸引力。我们概述了一种基于距离的供体选择的迭代热卡多重填充策略。将距离定义为病例之间预测均值差异的单调函数,选择供体的概率与其到受者的距离成反比。该方法保留了临时技术的易于实现性,同时融入了贝叶斯方法的理想特征。我们方法的特殊情况包括最近邻填充和简单随机热卡。迭代该过程类似于马尔可夫链蒙特卡罗方法,旨在减轻对初始值的依赖。给出了在纵向抑郁症治疗试验中填充缺失值的结果以及一项模拟研究的结果。我们评估了不同的距离定义、初始值的选择、选择用于填充的变量的顺序以及迭代次数如何影响推断。我们表明,我们的距离度量控制了估计偏差和方差之间的权衡。我们发现,抑郁症治疗试验中的推断对大多数距离定义不敏感。此外,虽然1次迭代和10次迭代之间存在差异,但基于10次迭代的推断与基于500次迭代的推断之间没有显著差异。初始值的选择对推断没有影响,但即使在迭代之后,选择用于填充的变量的顺序也很重要。