Levis Alexander W, Mukherjee Rajarshi, Wang Rui, Fischer Heidi, Haneuse Sebastien
Department of Statistics & Data Science, Carnegie Mellon University, Pittsburgh, Pennsylvania.
Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, Massachusetts.
Stat Med. 2024 Dec 30;43(30):6086-6098. doi: 10.1002/sim.10298. Epub 2024 Dec 5.
Missing data arise in most applied settings and are ubiquitous in electronic health records (EHR). When data are missing not at random (MNAR) with respect to measured covariates, sensitivity analyses are often considered. These solutions, however, are often unsatisfying in that they are not guaranteed to yield actionable conclusions. Motivated by an EHR-based study of long-term outcomes following bariatric surgery, we consider the use of double sampling as a means to mitigate MNAR outcome data when the statistical goals are estimation and inference regarding causal effects. We describe assumptions that are sufficient for the identification of the joint distribution of confounders, treatment, and outcome under this design. Additionally, we derive efficient and robust estimators of the average causal treatment effect under a nonparametric model and under a model assuming outcomes were, in fact, initially missing at random (MAR). We compare these in simulations to an approach that adaptively estimates based on evidence of violation of the MAR assumption. Finally, we also show that the proposed double sampling design can be extended to handle arbitrary coarsening mechanisms, and derive nonparametric efficient estimators of any smooth full data functional.
缺失数据在大多数实际应用场景中都会出现,并且在电子健康记录(EHR)中普遍存在。当数据相对于测量的协变量并非随机缺失(MNAR)时,通常会考虑进行敏感性分析。然而,这些方法往往并不令人满意,因为它们不一定能得出可采取行动的结论。受一项基于电子健康记录的减肥手术长期结果研究的启发,当统计目标是对因果效应进行估计和推断时,我们考虑使用双重抽样作为减轻MNAR结果数据的一种方法。我们描述了在此设计下足以识别混杂因素、治疗和结果的联合分布的假设。此外,我们在非参数模型以及假设结果实际上最初是随机缺失(MAR)的模型下,推导了平均因果治疗效应的有效且稳健的估计量。我们在模拟中将这些估计量与基于违反MAR假设的证据进行自适应估计的方法进行比较。最后,我们还表明,所提出的双重抽样设计可以扩展以处理任意的粗化机制,并推导任何平滑全数据函数的非参数有效估计量。