Shrive Fiona M, Stuart Heather, Quan Hude, Ghali William A
Department of Community Health Sciences, Faculty of Medicine, University of Calgary, Alberta, Canada.
BMC Med Res Methodol. 2006 Dec 13;6:57. doi: 10.1186/1471-2288-6-57.
Missing data present a challenge to many research projects. The problem is often pronounced in studies utilizing self-report scales, and literature addressing different strategies for dealing with missing data in such circumstances is scarce. The objective of this study was to compare six different imputation techniques for dealing with missing data in the Zung Self-reported Depression scale (SDS).
1580 participants from a surgical outcomes study completed the SDS. The SDS is a 20 question scale that respondents complete by circling a value of 1 to 4 for each question. The sum of the responses is calculated and respondents are classified as exhibiting depressive symptoms when their total score is over 40. Missing values were simulated by randomly selecting questions whose values were then deleted (a missing completely at random simulation). Additionally, a missing at random and missing not at random simulation were completed. Six imputation methods were then considered; 1) multiple imputation, 2) single regression, 3) individual mean, 4) overall mean, 5) participant's preceding response, and 6) random selection of a value from 1 to 4. For each method, the imputed mean SDS score and standard deviation were compared to the population statistics. The Spearman correlation coefficient, percent misclassified and the Kappa statistic were also calculated.
When 10% of values are missing, all the imputation methods except random selection produce Kappa statistics greater than 0.80 indicating 'near perfect' agreement. MI produces the most valid imputed values with a high Kappa statistic (0.89), although both single regression and individual mean imputation also produced favorable results. As the percent of missing information increased to 30%, or when unbalanced missing data were introduced, MI maintained a high Kappa statistic. The individual mean and single regression method produced Kappas in the 'substantial agreement' range (0.76 and 0.74 respectively).
Multiple imputation is the most accurate method for dealing with missing data in most of the missind data scenarios we assessed for the SDS. Imputing the individual's mean is also an appropriate and simple method for dealing with missing data that may be more interpretable to the majority of medical readers. Researchers should consider conducting methodological assessments such as this one when confronted with missing data. The optimal method should balance validity, ease of interpretability for readers, and analysis expertise of the research team.
缺失数据给许多研究项目带来了挑战。在使用自我报告量表的研究中,这个问题往往很突出,而针对这种情况下处理缺失数据的不同策略的文献却很匮乏。本研究的目的是比较六种不同的插补技术,以处理zung自评抑郁量表(SDS)中的缺失数据。
来自一项外科手术结果研究的1580名参与者完成了SDS。SDS是一个包含20个问题的量表,受访者通过为每个问题圈出1到4的分值来完成。计算回答的总和,当总分超过40分时,受访者被归类为表现出抑郁症状。通过随机选择问题并删除其值来模拟缺失值(完全随机缺失模拟)。此外,还完成了随机缺失和非随机缺失模拟。然后考虑六种插补方法;1)多重插补,2)单回归,3)个体均值,4)总体均值,5)参与者的前一个回答,6)从1到4随机选择一个值。对于每种方法,将插补后的SDS平均得分和标准差与总体统计数据进行比较。还计算了斯皮尔曼相关系数、错误分类百分比和卡帕统计量。
当10%的值缺失时,除了随机选择外,所有插补方法产生的卡帕统计量都大于0.80,表明“几乎完美”的一致性。多重插补产生的插补值最有效,卡帕统计量较高(0.89),尽管单回归和个体均值插补也产生了良好的结果。随着缺失信息的百分比增加到30%,或者引入不平衡的缺失数据时,多重插补保持了较高的卡帕统计量。个体均值和单回归方法产生的卡帕值在“实质性一致”范围内(分别为0.76和0.74)。
在我们评估的SDS的大多数缺失数据情况下,多重插补是处理缺失数据最准确的方法。插补个体均值也是处理缺失数据的一种合适且简单的方法,对于大多数医学读者来说可能更易于解释。研究人员在面对缺失数据时应考虑进行这样的方法学评估。最佳方法应在有效性、读者的可解释性和研究团队的分析专业知识之间取得平衡。