Institute for Clinical Evaluative Sciences, Toronto, ON, Canada.
Stat Med. 2010 Sep 10;29(20):2137-48. doi: 10.1002/sim.3854.
Propensity score methods are increasingly being used to estimate the effects of treatments on health outcomes using observational data. There are four methods for using the propensity score to estimate treatment effects: covariate adjustment using the propensity score, stratification on the propensity score, propensity-score matching, and inverse probability of treatment weighting (IPTW) using the propensity score. When outcomes are binary, the effect of treatment on the outcome can be described using odds ratios, relative risks, risk differences, or the number needed to treat. Several clinical commentators suggested that risk differences and numbers needed to treat are more meaningful for clinical decision making than are odds ratios or relative risks. However, there is a paucity of information about the relative performance of the different propensity-score methods for estimating risk differences. We conducted a series of Monte Carlo simulations to examine this issue. We examined bias, variance estimation, coverage of confidence intervals, mean-squared error (MSE), and type I error rates. A doubly robust version of IPTW had superior performance compared with the other propensity-score methods. It resulted in unbiased estimation of risk differences, treatment effects with the lowest standard errors, confidence intervals with the correct coverage rates, and correct type I error rates. Stratification, matching on the propensity score, and covariate adjustment using the propensity score resulted in minor to modest bias in estimating risk differences. Estimators based on IPTW had lower MSE compared with other propensity-score methods. Differences between IPTW and propensity-score matching may reflect that these two methods estimate the average treatment effect and the average treatment effect for the treated, respectively.
倾向评分法越来越多地被用于使用观察性数据估计治疗对健康结果的影响。使用倾向评分估计治疗效果有四种方法:使用倾向评分进行协变量调整、倾向评分分层、倾向评分匹配和使用倾向评分进行逆概率治疗加权(IPT)。当结果为二分类时,治疗对结果的影响可以用优势比、相对风险、风险差异或需要治疗的人数来描述。一些临床评论员认为,风险差异和需要治疗的人数比优势比或相对风险更有助于临床决策。然而,关于不同倾向评分方法估计风险差异的相对性能的信息很少。我们进行了一系列蒙特卡罗模拟来研究这个问题。我们检查了偏差、方差估计、置信区间覆盖、均方误差(MSE)和 I 型错误率。IPT 的双重稳健版本与其他倾向评分方法相比具有更好的性能。它导致风险差异的无偏估计、具有最低标准误差的治疗效果、具有正确覆盖率的置信区间和正确的 I 型错误率。倾向评分分层、倾向评分匹配和使用倾向评分进行协变量调整会导致风险差异估计的轻微到适度偏差。基于 IPT 的估计量与其他倾向评分方法相比具有更低的 MSE。IPT 和倾向评分匹配之间的差异可能反映了这两种方法分别估计治疗效果和治疗效果。