Department of Quantitative Health Sciences, Cleveland Clinic , Cleveland, OH , USA.
PeerJ. 2013 Aug 1;1:e123. doi: 10.7717/peerj.123. Print 2013.
Background. Propensity score usage seems to be growing in popularity leading researchers to question the possible role of propensity scores in prediction modeling, despite the lack of a theoretical rationale. It is suspected that such requests are due to the lack of differentiation regarding the goals of predictive modeling versus causal inference modeling. Therefore, the purpose of this study is to formally examine the effect of propensity scores on predictive performance. Our hypothesis is that a multivariable regression model that adjusts for all covariates will perform as well as or better than those models utilizing propensity scores with respect to model discrimination and calibration. Methods. The most commonly encountered statistical scenarios for medical prediction (logistic and proportional hazards regression) were used to investigate this research question. Random cross-validation was performed 500 times to correct for optimism. The multivariable regression models adjusting for all covariates were compared with models that included adjustment for or weighting with the propensity scores. The methods were compared based on three predictive performance measures: (1) concordance indices; (2) Brier scores; and (3) calibration curves. Results. Multivariable models adjusting for all covariates had the highest average concordance index, the lowest average Brier score, and the best calibration. Propensity score adjustment and inverse probability weighting models without adjustment for all covariates performed worse than full models and failed to improve predictive performance with full covariate adjustment. Conclusion. Propensity score techniques did not improve prediction performance measures beyond multivariable adjustment. Propensity scores are not recommended if the analytical goal is pure prediction modeling.
背景。尽管缺乏理论依据,但倾向评分的使用似乎越来越受欢迎,这导致研究人员质疑其在预测建模中的可能作用。人们怀疑这种请求是由于缺乏对预测建模和因果推断建模目标的区分。因此,本研究的目的是正式检验倾向评分对预测性能的影响。我们的假设是,对于多变量回归模型,调整所有协变量的模型在模型区分度和校准方面的表现将与使用倾向评分的模型一样好或更好。
方法。使用最常见的医学预测统计场景(逻辑和比例风险回归)来研究这个问题。进行了 500 次随机交叉验证以纠正乐观。与包含倾向评分调整或加权的模型相比,调整所有协变量的多变量回归模型进行了比较。基于三个预测性能指标比较了方法:(1)一致性指数;(2)Brier 分数;和(3)校准曲线。
结果。调整所有协变量的多变量模型具有最高的平均一致性指数、最低的平均 Brier 分数和最佳的校准。没有对所有协变量进行调整的倾向评分调整和逆概率加权模型的表现不如全模型差,并且无法通过对全协变量的调整来提高预测性能。
结论。倾向评分技术并不能提高预测性能指标,超过多变量调整。如果分析目标是纯粹的预测建模,则不建议使用倾向评分。