Department of Epidemiology, UT MD Anderson Cancer Center, Houston, TX 77030, USA.
J Immigr Minor Health. 2011 Dec;13(6):1099-109. doi: 10.1007/s10903-010-9415-8.
Missing data often occur in cross-sectional surveys and longitudinal and experimental studies. The purpose of this study was to compare the prediction of self-rated health (SRH), a robust predictor of morbidity and mortality among diverse populations, before and after imputation of the missing variable "yearly household income." We reviewed data from 4,162 participants of Mexican origin recruited from July 1, 2002, through December 31, 2005, and who were enrolled in a population-based cohort study. Missing yearly income data were imputed using three different single imputation methods and one multiple imputation under a Bayesian approach. Of 4,162 participants, 3,121 were randomly assigned to a training set (to derive the yearly income imputation methods and develop the health-outcome prediction models) and 1,041 to a testing set (to compare the areas under the curve (AUC) of the receiver-operating characteristic of the resulting health-outcome prediction models). The discriminatory powers of the SRH prediction models were good (range, 69-72%) and compared to the prediction model obtained after no imputation of missing yearly income, all other imputation methods improved the prediction of SRH (P < 0.05 for all comparisons) with the AUC for the model after multiple imputation being the highest (AUC = 0.731). Furthermore, given that yearly income was imputed using multiple imputation, the odds of SRH as good or better increased by 11% for each $5,000 increment in yearly income. This study showed that although imputation of missing data for a key predictor variable can improve a risk health-outcome prediction model, further work is needed to illuminate the risk factors associated with SRH.
缺失数据在横断面调查、纵向研究和实验研究中经常发生。本研究旨在比较缺失变量"年家庭收入"的插补前后对自评健康(SRH)的预测,SRH 是不同人群发病率和死亡率的有力预测指标。我们回顾了 2002 年 7 月 1 日至 2005 年 12 月 31 日期间招募的墨西哥裔来源的 4162 名参与者的数据,这些参与者参加了一项基于人群的队列研究。使用三种不同的单一插补方法和一种贝叶斯多变量插补方法对缺失的年家庭收入数据进行了插补。在 4162 名参与者中,3121 名被随机分配到训练集(用于推导年家庭收入插补方法和开发健康结果预测模型),1041 名被分配到测试集(用于比较由此产生的健康结果预测模型的接收者操作特征曲线(ROC)下面积(AUC)。SRH 预测模型的判别能力较好(范围为 69-72%),与未对缺失的年家庭收入进行插补的预测模型相比,所有其他插补方法均提高了 SRH 的预测(所有比较 P<0.05),其中多变量插补后的模型 AUC 最高(AUC=0.731)。此外,鉴于使用多变量插补对缺失数据进行了插补,年家庭收入每增加 5000 美元,SRH 为良好或更好的几率增加 11%。本研究表明,尽管对关键预测变量的缺失数据进行插补可以改进健康结果的风险预测模型,但仍需要进一步研究阐明与 SRH 相关的风险因素。