Faculty of Biology, Medicine and Health, Vaughan House, University of Manchester, Manchester, M13 9PL, UK.
BMC Med Res Methodol. 2020 Jul 8;20(1):185. doi: 10.1186/s12874-020-01068-x.
Within routinely collected health data, missing data for an individual might provide useful information in itself. This occurs, for example, in the case of electronic health records, where the presence or absence of data is informative. While the naive use of missing indicators to try to exploit such information can introduce bias, its use in conjunction with multiple imputation may unlock the potential value of missingness to reduce bias in causal effect estimation, particularly in missing not at random scenarios and where missingness might be associated with unmeasured confounders.
We conducted a simulation study to determine when the use of a missing indicator, combined with multiple imputation, would reduce bias for causal effect estimation, under a range of scenarios including unmeasured variables, missing not at random, and missing at random mechanisms. We use directed acyclic graphs and structural models to elucidate a variety of causal structures of interest. We handled missing data using complete case analysis, and multiple imputation with and without missing indicator terms.
We find that multiple imputation combined with a missing indicator gives minimal bias for causal effect estimation in most scenarios. In particular the approach: 1) does not introduce bias in missing (completely) at random scenarios; 2) reduces bias in missing not at random scenarios where the missing mechanism depends on the missing variable itself; and 3) may reduce or increase bias when unmeasured confounding is present.
In the presence of missing data, careful use of missing indicators, combined with multiple imputation, can improve causal effect estimation when missingness is informative, and is not detrimental when missingness is at random.
在常规收集的健康数据中,个体的缺失数据本身可能提供有用的信息。例如,在电子健康记录中,数据的存在或缺失是有信息的。虽然使用缺失指标来尝试利用这种信息可能会引入偏差,但将其与多重插补结合使用可能会释放缺失数据的潜在价值,以减少因果效应估计中的偏差,特别是在缺失数据不是随机的情况下,并且缺失数据可能与未测量的混杂因素有关。
我们进行了一项模拟研究,以确定在多种情况下,包括未测量的变量、非随机缺失和随机缺失机制,使用缺失指标结合多重插补是否会减少因果效应估计中的偏差。我们使用有向无环图和结构模型来阐明各种感兴趣的因果结构。我们使用完整病例分析和带有或不带有缺失指标项的多重插补来处理缺失数据。
我们发现,在大多数情况下,多重插补结合缺失指标对因果效应估计的偏差最小。特别是该方法:1)在完全随机缺失的情况下不会引入偏差;2)在缺失机制取决于缺失变量本身的非随机缺失情况下减少偏差;3)当存在未测量的混杂因素时,可能会减少或增加偏差。
在存在缺失数据的情况下,当缺失数据是有信息的,并且缺失数据是随机的时候,谨慎使用缺失指标,结合多重插补,可以改善因果效应估计。