Guha Sharmistha, Reiter Jerome P
Department of Statistics, Texas A&M University, College Station, 77843, TX, USA.
Department of Statistical Science, Duke University, Durham, 27708, NC, USA.
J Stat Plan Inference. 2024 Mar;229. doi: 10.1016/j.jspi.2023.07.004. Epub 2023 Aug 1.
We consider causal inference for observational studies with data spread over two files. One file includes the treatment, outcome, and some covariates measured on a set of individuals, and the other file includes additional causally-relevant covariates measured on a partially overlapping set of individuals. By linking records in the two databases, the analyst can control for more covariates, thereby reducing the risk of bias compared to using only one file alone. When analysts do not have access to a unique identifier that enables perfect, error-free linkages, they typically rely on probabilistic record linkage to construct a single linked data set, and estimate causal effects using these linked data. This typical practice does not propagate uncertainty from imperfect linkages to the causal inferences. Further, it does not take advantage of relationships among the variables to improve the linkage quality. We address these shortcomings by fusing regression-assisted, Bayesian probabilistic record linkage with causal inference. The Markov chain Monte Carlo sampler generates multiple plausible linked data files as byproducts that analysts can use for multiple imputation inferences. Here, we show results for two causal estimators based on propensity score overlap weights. Using simulations and data from the Italy Survey on Household Income and Wealth, we show that our approach can improve the accuracy of estimated treatment effects.
我们考虑对数据分布在两个文件中的观察性研究进行因果推断。一个文件包含对一组个体测量的治疗、结局和一些协变量,另一个文件包含对部分重叠个体集测量的其他与因果相关的协变量。通过链接两个数据库中的记录,与仅使用一个文件相比,分析师可以控制更多协变量,从而降低偏差风险。当分析师无法获得能够实现完美、无错误链接的唯一标识符时,他们通常依靠概率性记录链接来构建单个链接数据集,并使用这些链接数据估计因果效应。这种典型做法不会将不完美链接中的不确定性传播到因果推断中。此外,它没有利用变量之间的关系来提高链接质量。我们通过将回归辅助的贝叶斯概率性记录链接与因果推断相结合来解决这些缺点。马尔可夫链蒙特卡罗采样器生成多个合理的链接数据文件作为副产品,分析师可以将其用于多重插补推断。在此,我们展示了基于倾向得分重叠权重的两个因果估计量的结果。使用来自意大利家庭收入和财富调查的模拟数据和实际数据,我们表明我们的方法可以提高估计治疗效果的准确性。