Department of Biostatistics, 1259University of Michigan, USA.
Survey Methodology Program, Institute for Social Research, USA.
Stat Methods Med Res. 2021 Dec;30(12):2685-2700. doi: 10.1177/09622802211047346. Epub 2021 Oct 13.
Multiple imputation is a well-established general technique for analyzing data with missing values. A convenient way to implement multiple imputation is sequential regression multiple imputation, also called chained equations multiple imputation. In this approach, we impute missing values using regression models for each variable, conditional on the other variables in the data. This approach, however, assumes that the missingness mechanism is missing at random, and it is not well-justified under not-at-random missingness without additional modification. In this paper, we describe how we can generalize the sequential regression multiple imputation imputation procedure to handle missingness not at random in the setting where missingness may depend on other variables that are also missing but not on the missing variable itself, conditioning on fully observed variables. We provide algebraic justification for several generalizations of standard sequential regression multiple imputation using Taylor series and other approximations of the target imputation distribution under missingness not at random. Resulting regression model approximations include indicators for missingness, interactions, or other functions of the missingness not at random missingness model and observed data. In a simulation study, we demonstrate that the proposed sequential regression multiple imputation modifications result in reduced bias in the final analysis compared to standard sequential regression multiple imputation, with an approximation strategy involving inclusion of an offset in the imputation model performing the best overall. The method is illustrated in a breast cancer study, where the goal is to estimate the prevalence of a specific genetic pathogenic variant.
多重插补是一种用于分析缺失值数据的成熟通用技术。实现多重插补的一种便捷方法是顺序回归多重插补,也称为链式方程多重插补。在这种方法中,我们使用回归模型对每个变量进行插补,这些回归模型的条件是数据中的其他变量。然而,这种方法假设缺失机制是随机缺失的,如果在没有额外修改的情况下缺失不是随机的,则该方法没有得到很好的证明。在本文中,我们描述了如何将顺序回归多重插补插补过程推广到在缺失可能取决于其他也缺失但不依赖于缺失变量本身的变量的情况下处理非随机缺失的情况,条件是完全观察到的变量。我们使用泰勒级数和其他缺失数据下目标插补分布的近似值为标准顺序回归多重插补的几种推广提供了代数证明。由此产生的回归模型近似包括缺失指示符、交互作用或其他非随机缺失模型和观测数据的缺失函数。在一项模拟研究中,我们证明了与标准顺序回归多重插补相比,所提出的顺序回归多重插补修改会导致最终分析中的偏差降低,其中涉及在插补模型中包含偏移量的近似策略总体性能最佳。该方法在乳腺癌研究中得到了说明,目的是估计特定遗传致病性变体的流行率。