Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA.
Duke Global Health Institute, Duke University, Durham, NC, USA.
Stat Methods Med Res. 2020 Dec;29(12):3721-3756. doi: 10.1177/0962280220940334. Epub 2020 Jul 21.
Propensity score weighting methods are often used in non-randomized studies to adjust for confounding and assess treatment effects. The most popular among them, the inverse probability weighting, assigns weights that are proportional to the inverse of the conditional probability of a specific treatment assignment, given observed covariates. A key requirement for inverse probability weighting estimation is the positivity assumption, i.e. the propensity score must be bounded away from 0 and 1. In practice, violations of the positivity assumption often manifest by the presence of limited overlap in the propensity score distributions between treatment groups. When these practical violations occur, a small number of highly influential inverse probability weights may lead to unstable inverse probability weighting estimators, with biased estimates and large variances. To mitigate these issues, a number of alternative methods have been proposed, including inverse probability weighting trimming, overlap weights, matching weights, and entropy weights. Because overlap weights, matching weights, and entropy weights target the population for whom there is equipoise (and with adequate overlap) and their estimands depend on the true propensity score, a common criticism is that these estimators may be more sensitive to misspecifications of the propensity score model. In this paper, we conduct extensive simulation studies to compare the performances of inverse probability weighting and inverse probability weighting trimming against those of overlap weights, matching weights, and entropy weights under limited overlap and misspecified propensity score models. Across the wide range of scenarios we considered, overlap weights, matching weights, and entropy weights consistently outperform inverse probability weighting in terms of bias, root mean squared error, and coverage probability.
倾向评分加权方法常用于非随机研究中,以调整混杂因素并评估治疗效果。其中最受欢迎的是逆概率加权法,它为每个观察到的协变量赋予权重,权重与特定治疗分配的条件概率的倒数成正比。逆概率加权估计的一个关键要求是正性假设,即倾向评分必须远离 0 和 1。在实践中,正性假设的违反通常表现为治疗组之间的倾向评分分布存在有限重叠。当这些实际违反发生时,少数高度有影响力的逆概率权重可能导致逆概率加权估计量不稳定,估计值存在偏差且方差较大。为了解决这些问题,已经提出了许多替代方法,包括逆概率加权修剪、重叠权重、匹配权重和熵权重。由于重叠权重、匹配权重和熵权重针对平衡(且具有足够重叠)的人群,并且它们的估计量取决于真实的倾向评分,因此一个常见的批评是,这些估计量可能对倾向评分模型的指定更敏感。在本文中,我们进行了广泛的模拟研究,比较了在有限重叠和指定倾向评分模型下,逆概率加权和逆概率加权修剪与重叠权重、匹配权重和熵权重的性能。在我们考虑的广泛场景中,重叠权重、匹配权重和熵权重在偏差、均方根误差和覆盖概率方面始终优于逆概率加权。