Musil Carol M, Warner Camille B, Yobas Piyanee Klainin, Jones Susan L
Frances Payne Bolton School of Nursing, Case Western Reserve University, Cleveland, Ohio, USA.
West J Nurs Res. 2002 Nov;24(7):815-29. doi: 10.1177/019394502762477004.
Researchers are commonly faced with the problem of missing data. This article presents theoretical and empirical information for the selection and application of approaches for handling missing data on a single variable. An actual data set of 492 cases with no missing values was used to create a simulated yet realistic data set with missing at random (MAR) data. The authors compare and contrast five approaches (listwise deletion, mean substitution, simple regression, regression with an error term, and the expectation maximization [EM] algorithm) for dealing with missing data, and compare the effects of each method on descriptive statistics and correlation coefficients for the imputed data (n = 96) and the entire sample (n = 492) when imputed data are inculded. All methods had limitations, although our findings suggest that mean substitution was the least effective and that regression with an error term and the EM algorithm produced estimates closest to those of the original variables.
研究人员通常会面临数据缺失的问题。本文提供了关于处理单个变量缺失数据方法的选择与应用的理论和实证信息。使用一个包含492个无缺失值案例的实际数据集来创建一个具有随机缺失(MAR)数据的模拟但现实的数据集。作者比较并对比了五种处理缺失数据的方法(删除列表法、均值替换法、简单回归法、带误差项的回归法以及期望最大化[EM]算法),并在纳入插补数据时,比较了每种方法对插补数据(n = 96)和整个样本(n = 492)的描述性统计量和相关系数的影响。所有方法都有局限性,不过我们的研究结果表明,均值替换法效果最差,而带误差项的回归法和EM算法得出的估计值最接近原始变量的估计值。