Li Y M, Zhao P, Yang Y H, Wang J X, Yan H, Chen F Y
Department of Epidemiology and Biostatistics, School of Public Health of Xi'an Jiaotong University Health Science Center, Xi'an 710061, China.
Zhonghua Liu Xing Bing Xue Za Zhi. 2021 Oct 10;42(10):1889-1894. doi: 10.3760/cma.j.cn112338-20201130-01363.
Data being missed is an unavoidable problem in cohort studies. This paper compares the imputation effect of eight common missing data imputation methods involved in cutting longitudinal data through simulation study to provide a valuable reference for the treatment of missing data in longitudinal studies. The simulation study is based on R language software and generates missing longitudinal data by the Monte Carlo method. By comparing the average absolute deviation, average relative deviation, and TypeⅠerror from the regression analysis of different imputation methods, the imputation effect of varying imputation methods on missing longitudinal data and the influence on subsequent multivariate analysis are evaluated. The mean imputation, k nearest neighbor (KNN), regression imputation, and random forest all have a similar imputation effect, which is also steady. However, the hot deck is inferior to the above imputation methods. K-means clustering and expectation maximization (EM) algorithm are among the worst and unstable. Mean imputation, EM algorithm, random forest, KNN, and regression imputation can control TypeⅠerror. Still, multiple imputations, hot deck, and K-means clustering cannot effectively manage the TypeⅠerror. For missing data in longitudinal studies, mean imputation, KNN, regression imputation, and random forest can be used as better imputation methods under the mechanism of missing at random. When the missing ratio is not too large, multiple imputations and hot deck can also perform well, but K-means clustering and EM algorithm are not recommended.
在队列研究中,数据缺失是一个不可避免的问题。本文通过模拟研究比较了八种常见的缺失数据插补方法在截断纵向数据时的插补效果,为纵向研究中缺失数据的处理提供有价值的参考。模拟研究基于R语言软件,采用蒙特卡罗方法生成缺失的纵向数据。通过比较不同插补方法回归分析的平均绝对偏差、平均相对偏差和Ⅰ类错误,评估不同插补方法对缺失纵向数据的插补效果以及对后续多变量分析的影响。均值插补、k近邻(KNN)、回归插补和随机森林的插补效果相似,且较为稳定。然而,热卡插补不如上述插补方法。K均值聚类和期望最大化(EM)算法是最差且不稳定的。均值插补、EM算法、随机森林、KNN和回归插补可以控制Ⅰ类错误。但多重插补、热卡插补和K均值聚类不能有效控制Ⅰ类错误。对于纵向研究中的缺失数据,在随机缺失机制下,均值插补、KNN、回归插补和随机森林可作为较好的插补方法。当缺失率不是太大时,多重插补和热卡插补也能表现良好,但不推荐使用K均值聚类和EM算法。