Si Yajuan, Heeringa Steve, Johnson David, Little Roderick J A, Liu Wenshuo, Pfeffer Fabian, Raghunathan Trivellore
Research Assistant Professor, Survey Research Center, Institute for Social Research, University of Michigan, 426 Thompson St., Ann Arbor, MI 48104, USA.
Senior Research Scientist, Survey Research Center, Institute for Social Research, University of Michigan, 426 Thompson St., Ann Arbor, MI 48104, USA.
J Surv Stat Methodol. 2021 Oct 19;11(1):260-283. doi: 10.1093/jssam/smab038. eCollection 2023 Feb.
Multiple imputation (MI) is a popular and well-established method for handling missing data in multivariate data sets, but its practicality for use in massive and complex data sets has been questioned. One such data set is the Panel Study of Income Dynamics (PSID), a longstanding and extensive survey of household income and wealth in the United States. Missing data for this survey are currently handled using traditional hot deck methods because of the simple implementation; however, the univariate hot deck results in large random wealth fluctuations. MI is effective but faced with operational challenges. We use a sequential regression/chained-equation approach, using the software IVEware, to multiply impute cross-sectional wealth data in the 2013 PSID, and compare analyses of the resulting imputed data with those from the current hot deck approach. Practical difficulties, such as non-normally distributed variables, skip patterns, categorical variables with many levels, and multicollinearity, are described together with our approaches to overcoming them. We evaluate the imputation quality and validity with internal diagnostics and external benchmarking data. MI produces improvements over the existing hot deck approach by helping preserve correlation structures, such as the associations between PSID wealth components and the relationships between the household net worth and sociodemographic factors, and facilitates completed data analyses with general purposes. MI incorporates highly predictive covariates into imputation models and increases efficiency. We recommend the practical implementation of MI and expect greater gains when the fraction of missing information is large.
多重填补(MI)是一种在多变量数据集中处理缺失数据的常用且成熟的方法,但其在大规模复杂数据集中的实用性受到了质疑。收入动态面板研究(PSID)就是这样一个数据集,它是美国一项关于家庭收入和财富的长期且广泛的调查。由于实施简单,该调查目前使用传统的热卡方法处理缺失数据;然而,单变量热卡会导致财富出现大幅随机波动。多重填补有效但面临操作挑战。我们使用顺序回归/链式方程方法,借助IVEware软件,对2013年PSID中的横截面财富数据进行多重填补,并将所得填补数据的分析结果与当前热卡方法的分析结果进行比较。文中描述了诸如变量非正态分布、跳答模式、具有多个层次的分类变量以及多重共线性等实际困难,以及我们克服这些困难的方法。我们通过内部诊断和外部基准数据评估填补质量和有效性。多重填补通过帮助保留相关结构(如PSID财富组成部分之间的关联以及家庭净资产与社会人口因素之间的关系),对现有的热卡方法做出了改进,并便于进行通用的完整数据分析。多重填补将高度预测性的协变量纳入填补模型并提高了效率。我们建议实际应用多重填补,并且预计当缺失信息比例较大时会有更大的收益。