1 Department of Statistics, Government College University, Faisalabad, Pakistan.
2 Department of Statistics, Ludwig Maximilians University Munich, Munich, Germany.
Stat Methods Med Res. 2019 May;28(5):1311-1327. doi: 10.1177/0962280218755574. Epub 2018 Feb 16.
Missing data is a common issue that can cause problems in estimation and inference in biomedical, epidemiological and social research. Multiple imputation is an increasingly popular approach for handling missing data. In case of a large number of covariates with missing data, existing multiple imputation software packages may not work properly and often produce errors. We propose a multiple imputation algorithm called mispr based on sequential penalized regression models. Each variable with missing values is assumed to have a different distributional form and is imputed with its own imputation model using the ridge penalty. In the case of a large number of predictors with respect to the sample size, the use of a quadratic penalty guarantees unique estimates for the parameters and leads to better predictions than the usual Maximum Likelihood Estimation (MLE), with a good compromise between bias and variance. As a result, the proposed algorithm performs well and provides imputed values that are better even for a large number of covariates with small samples. The results are compared with the existing software packages mice, VIM and Amelia in simulation studies. The missing at random mechanism was the main assumption in the simulation study. The imputation performance of the proposed algorithm is evaluated with mean squared imputation error and mean absolute imputation error. The mean squared error ( ), parameter estimates with their standard errors and confidence intervals are also computed to compare the performance in the regression context. The proposed algorithm is observed to be a good competitor to the existing algorithms, with smaller mean squared imputation error, mean absolute imputation error and mean squared error. The algorithm's performance becomes considerably better than that of the existing algorithms with increasing number of covariates, especially when the number of predictors is close to or even greater than the sample size. Two real-life datasets are also used to examine the performance of the proposed algorithm using simulations.
数据缺失是生物医学、流行病学和社会研究中估计和推断的一个常见问题。多重插补是处理缺失数据的一种越来越流行的方法。在存在大量具有缺失数据的协变量的情况下,现有的多重插补软件包可能无法正常工作,并且经常会产生错误。我们提出了一种称为 mispr 的多重插补算法,该算法基于序贯惩罚回归模型。每个具有缺失值的变量都假定具有不同的分布形式,并使用岭惩罚为其各自的插补模型进行插补。在相对于样本量的大量预测变量的情况下,使用二次惩罚可以保证参数的唯一估计,并导致比通常的最大似然估计(MLE)更好的预测,在偏差和方差之间取得了良好的折衷。因此,该算法表现良好,即使对于具有小样本的大量协变量,也能提供更好的插补值。结果与模拟研究中的现有软件包 mice、VIM 和 Amelia 进行了比较。随机缺失机制是模拟研究的主要假设。通过均方插补误差和均方绝对插补误差来评估所提出算法的插补性能。还计算了参数估计的均方误差( )及其标准误差和置信区间,以比较回归背景下的性能。所提出的算法被观察到是现有算法的一个很好的竞争者,具有较小的均方插补误差、均方绝对插补误差和均方误差。随着协变量数量的增加,该算法的性能变得明显优于现有算法,尤其是当预测变量的数量接近甚至大于样本量时。还使用两个实际数据集通过模拟来检查所提出算法的性能。