Epidemiology, Biostatistics and Prevention Institute, Department of Biostatistics, University of Zurich, Zurich, Switzerland.
Institute of Intensive Care Medicine, University Hospital Zurich, Zurich, Switzerland.
Biom J. 2024 Jan;66(1):e2100292. doi: 10.1002/bimj.202100292. Epub 2022 Apr 6.
Propensity score matching is increasingly being used in the medical literature. Choice of matching algorithms, reporting quality, and estimands are oftentimes not discussed. We evaluated the impact of propensity score matching algorithms, based on a recent clinical dataset, with three commonly used outcomes. The resulting estimands for different strengths of treatment effects were compared in a neutral comparison study and based on a thoroughly designed simulation study. Different algorithms yielded different levels of balance after matching. Along with full matching and genetic matching with replacement, good balance was achieved with nearest neighbor matching with caliper but thereby more than one fifth of the treated units were discarded. Average marginal treatment effect estimates were least biased with genetic or nearest neighbor matching, both with replacement and full matching. Double adjustment yielded conditional treatment effects that were closer to the true values, throughout. The choice of the matching algorithm had an impact on covariate balance after matching as well as treatment effect estimates. In comparison, genetic matching with replacement yielded better covariate balance than all other matching algorithms. A literature review in the British Medical Journal including its subjournals revealed frequent use of propensity score matching; however, the use of different matching algorithms before treatment effect estimation was only reported in one out of 21 studies. Propensity score matching is a methodology for causal treatment effect estimation from observational data; however, the methodological difficulties and low reporting quality in applied medical research need to be addressed.
倾向评分匹配在医学文献中越来越多地被使用。匹配算法的选择、报告质量和估计量通常没有被讨论。我们根据最近的临床数据集,用三种常用的结局来评估倾向评分匹配算法的影响。在一项中立的比较研究中,比较了不同治疗效果强度的不同估计量,同时还进行了一项精心设计的模拟研究。不同的算法在匹配后产生了不同水平的平衡。除了完全匹配和带替换的遗传匹配之外,使用卡尺的最近邻匹配也可以达到很好的平衡,但同时有超过五分之一的处理单位被丢弃。平均边际治疗效果估计值受遗传匹配或最近邻匹配的影响最小,无论是带替换还是完全匹配。双调整产生的条件处理效果在整个过程中更接近真实值。匹配算法的选择不仅会影响匹配后的协变量平衡,还会影响处理效果估计值。相比之下,带替换的遗传匹配比其他所有匹配算法都能更好地平衡协变量。《英国医学杂志》及其子刊中的文献综述显示,倾向评分匹配的使用频率很高;然而,在估计处理效果之前使用不同的匹配算法,在 21 项研究中只有一项报告了。倾向评分匹配是一种从观察性数据中估计因果处理效果的方法;然而,在应用医学研究中,方法学上的困难和低报告质量需要得到解决。