Hripcsak George, Zhang Linying, Chen Yong, Li Kelly, Suchard Marc A, Ryan Patrick B, Schuemie Martijn J
Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA.
Observational Health Data Science and Informatics, New York, NY, USA.
medRxiv. 2025 Feb 21:2024.04.23.24306230. doi: 10.1101/2024.04.23.24306230.
Propensity score adjustment addresses confounding by balancing covariates in subject treatment groups through matching, stratification, or weighting. Diagnostics test the success of adjustment. For example, if the standardized mean difference (SMD) for a relevant covariate exceeds a threshold like 0.1, the covariate is considered imbalanced and the study may be invalid. Unfortunately, for studies with small or moderate numbers of subjects, the probability of falsely rejecting the validity of a study because of chance imbalance-the probability of asserting imbalance by using a cutoff for SMD when no underlying imbalance exists-can be grossly larger than a given nominal level like 0.05. In this paper, we illustrate that chance imbalance is operative in real-world settings even for moderate sample sizes of 2000. We identify a previously unrecognized challenge that as meta-analyses increase the precision of an effect estimate, the diagnostics must also undergo meta-analysis for a corresponding increase in precision. We propose an alternative diagnostic that checks whether the standardized mean difference statistically significantly exceeds the threshold. Through simulation and real-world data, we find that this diagnostic achieves a better trade-off of type 1 error rate and power than standard nominal threshold tests and not testing for sample sizes from 250 to 4000 and for 20 to 100,000 covariates. We confirm that in network studies, meta-analysis of effect estimates must be accompanied by meta-analysis of the diagnostics or else systematic confounding may overwhelm the estimated effect. Our procedure supports the review of large numbers of covariates, enabling more rigorous diagnostics.
倾向得分调整通过匹配、分层或加权来平衡受试者治疗组中的协变量,从而解决混杂问题。诊断用于检验调整的成功与否。例如,如果相关协变量的标准化均值差异(SMD)超过0.1这样的阈值,则该协变量被认为是不平衡的,研究可能无效。不幸的是,对于受试者数量较少或适中的研究,由于偶然不平衡而错误拒绝研究有效性的概率——即在不存在潜在不平衡时使用SMD截止值断言不平衡的概率——可能远大于给定的名义水平,如0.05。在本文中,我们表明即使对于2000这样的中等样本量,偶然不平衡在现实世界中也是存在的。我们发现了一个以前未被认识到的挑战,即随着荟萃分析提高效应估计的精度,诊断也必须进行荟萃分析以相应提高精度。我们提出了一种替代诊断方法,检查标准化均值差异是否在统计上显著超过阈值。通过模拟和实际数据,我们发现对于样本量从250到4000以及20到100000个协变量的情况,这种诊断方法在一类错误率和检验功效之间实现了比标准名义阈值检验以及不进行检验更好的权衡。我们证实,在网络研究中,效应估计的荟萃分析必须伴随着诊断的荟萃分析,否则系统性混杂可能会掩盖估计效应。我们的方法支持对大量协变量进行审查,从而实现更严格的诊断。