Gruber Susan, Tchetgen Tchetgen Eric
Reagan-Udall Foundation for the Food and Drug Administration, Washington, DC, U.S.A.
Departments of Biostatistics and Epidemiology, Harvard School of Public Health, Boston, MA, U.S.A.
Stat Med. 2016 Sep 30;35(22):3869-82. doi: 10.1002/sim.6936. Epub 2016 Mar 10.
Controversy over non-reproducible published research reporting a statistically significant result has produced substantial discussion in the literature. p-value calibration is a recently proposed procedure for adjusting p-values to account for both random and systematic errors that address one aspect of this problem. The method's validity rests on the key assumption that bias in an effect estimate is drawn from a normal distribution whose mean and variance can be correctly estimated. We investigated the method's control of type I and type II error rates using simulated and real-world data. Under mild violations of underlying assumptions, control of the type I error rate can be conservative, while under more extreme departures, it can be anti-conservative. The extent to which the assumption is violated in real-world data analyses is unknown. Barriers to testing the plausibility of the assumption using historical data are discussed. Our studies of the type II error rate using simulated and real-world electronic health care data demonstrated that calibrating p-values can substantially increase the type II error rate. The use of calibrated p-values may reduce the number of false-positive results, but there will be a commensurate drop in the ability to detect a true safety or efficacy signal. While p-value calibration can sometimes offer advantages in controlling the type I error rate, its adoption for routine use in studies of real-world health care datasets is premature. Separate characterizations of random and systematic errors provide a richer context for evaluating uncertainty surrounding effect estimates. Copyright © 2016 John Wiley & Sons, Ltd.
已发表的不可重复的研究报告了具有统计学意义的结果,这一争议在文献中引发了大量讨论。p值校准是最近提出的一种用于调整p值的程序,以考虑随机误差和系统误差,解决了这一问题的一个方面。该方法的有效性基于一个关键假设,即效应估计中的偏差来自正态分布,其均值和方差可以正确估计。我们使用模拟数据和实际数据研究了该方法对I型和II型错误率的控制情况。在对基本假设的轻微违反情况下,对I型错误率的控制可能较为保守,而在更极端的偏离情况下,可能会出现反保守情况。在实际数据分析中假设被违反的程度尚不清楚。讨论了使用历史数据检验假设合理性的障碍。我们使用模拟数据和实际电子医疗保健数据对II型错误率的研究表明,校准p值会大幅增加II型错误率。使用校准后的p值可能会减少假阳性结果的数量,但检测真实安全性或有效性信号的能力也会相应下降。虽然p值校准有时在控制I型错误率方面可能具有优势,但将其用于实际医疗保健数据集研究的常规应用还为时过早。对随机误差和系统误差的单独表征为评估效应估计周围的不确定性提供了更丰富的背景。版权所有© 2016约翰·威利父子有限公司。