Department of Epidemiology, Fielding School of Public Health, University of California, Los Angeles, Los Angeles, CA, USA.
Statistical Methods and Data Analytics, Office of Advanced Research Computing, University of California, Los Angeles, Los Angeles, CA, USA.
BMC Med Res Methodol. 2022 Apr 3;22(1):90. doi: 10.1186/s12874-022-01554-4.
Although standardized measures to assess substance use are available, most studies use variations of these measures making it challenging to harmonize data across studies. The aim of this study was to evaluate the performance of different strategies to impute missing substance use data that may result as part of data harmonization procedures.
We used self-reported substance use data collected between August 2014 and June 2019 from 528 participants with 2,389 study visits in a cohort study of substance use and HIV. We selected a low (heroin), medium (methamphetamine), and high (cannabis) prevalence drug and set 10-50% of each substance to missing. The data amputation mimicked missingness that results from harmonization of disparate measures. We conducted Monte Carlo simulations to evaluate the comparative performance of single and multiple imputation (MI) methods using the relative mean bias, root mean square error (RMSE), and coverage probability of the 95% confidence interval for each imputed estimate.
Without imputation (i.e., listwise deletion), estimates of substance use were biased, especially for low prevalence outcomes such as heroin. For instance, even when 10% of data were missing, the complete case analysis underestimated the prevalence of heroin by 33%. MI, even with as few as five imputations produced the least biased estimates, however, for a high prevalence outcome such as cannabis with low to moderate missingness, performance of single imputation strategies improved. For instance, in the case of cannabis, with 10% missingness, single imputation with regression performed just as well as multiple imputation resulting in minimal bias (relative mean bias of 0.06% and 0.07% respectively) and comparable performance (RMSE = 0.0102 for both and coverage of 95.8% and 96.2% respectively).
Our results from imputation of missing substance use data resulting from data harmonization indicate that MI provided the best performance across a range of conditions. Additionally, single imputation for substance use data performed comparably under scenarios where the prevalence of the outcome was high and missingness was low. These findings provide a practical application for the evaluation of several imputation strategies and helps to address missing data problem when combining data from individual studies.
尽管有评估物质使用的标准化测量方法,但大多数研究都使用这些方法的变体,这使得在研究之间协调数据变得具有挑战性。本研究的目的是评估不同策略在推断可能作为数据协调程序一部分的缺失物质使用数据方面的表现。
我们使用了 2014 年 8 月至 2019 年 6 月期间从一个物质使用和 HIV 队列研究中 528 名参与者的 2389 次研究访问中收集的自我报告的物质使用数据。我们选择了一种低(海洛因)、中(甲基苯丙胺)和高(大麻)流行率的药物,并将每种物质的 10-50%设置为缺失。数据截短模拟了来自不同措施协调的缺失。我们进行了蒙特卡罗模拟,以评估单和多次插补(MI)方法的相对平均偏差、均方根误差(RMSE)和 95%置信区间每个插补估计的覆盖率的比较性能。
在没有插补(即完全删除)的情况下,物质使用的估计值存在偏差,尤其是对于低流行率的结果,如海洛因。例如,即使数据缺失 10%,完整病例分析也会使海洛因的流行率低估 33%。MI,即使只有五次插补,也会产生偏差最小的估计值,然而,对于高流行率的结果,如大麻,低至中度缺失,单插补策略的性能有所提高。例如,在大麻的情况下,10%的缺失率下,回归的单插补与多次插补的效果一样好,导致最小的偏差(相对平均偏差分别为 0.06%和 0.07%)和可比的性能(RMSE 分别为 0.0102 和 95.8%和 96.2%)。
我们对数据协调过程中缺失物质使用数据进行插补的结果表明,MI 在一系列条件下提供了最佳性能。此外,在结果流行率高且缺失率低的情况下,物质使用数据的单插补表现相当。这些发现为评估几种插补策略提供了实际应用,并有助于解决合并个别研究数据时的缺失数据问题。