重新探讨关于 P 值和置信区间的反复争议。

Recurring controversies about P values and confidence intervals revisited.

出版信息

Ecology. 2014 Mar;95(3):645-51. doi: 10.1890/13-1291.1.

Abstract

The paper focused primarily on certain charges, claims, and interpretations of the P value as they relate to CIs and the AIC. It as argued that some of these comparisons and claims are misleading because they ignore key differences in the procedures being compared, such as (1) their primary aims and objectives, (2) the nature of the question posed to the data, as well as (3) the nature of their underlying reasoning and the ensuing inferences. In the case of the P value, the crucial issue is whether Fisher's evidential interpretation of the P value as "indicating the strength of evidence against H0" is appropriate. It is argued that, despite Fisher's maligning of the Type II error, a principled way to provide an adequate evidential account, in the form of post-data severity evaluation, calls for taking into account the power of the test. The error-statistical perspective brings out a key weakness of the P value and addresses several foundational issues raised in frequentist testing, including the fallacies of acceptance and rejection as well as misinterpretations of observed CIs: see Mayo-Spanos (2011). The paper also uncovers the connection between model selection procedures and hypothesis testing, revealing the inherent unreliability of the former. Hence, the choice between different procedures should not be "stylistic" (Murtaugh 2013), but should depend on the questions of interest, the answers sought, and the reliability of the procedures.

摘要

本文主要关注 P 值的某些特定指控、主张和解释，以及它们与置信区间（CI）和赤池信息量准则（AIC）的关系。本文认为，其中一些比较和主张具有误导性，因为它们忽略了正在比较的程序之间的关键差异，例如：（1）它们的主要目的和目标；（2）对数据提出的问题的性质；以及（3）其基础推理的性质以及随之而来的推断。就 P 值而言，关键问题是费希尔（Fisher）将 P 值的证据解释为“表示反对 H0 的证据强度”是否合适。本文认为，尽管费希尔对第二类错误进行了诋毁，但一种提供充分证据解释的原则方法，即通过后数据严重程度评估的方式，要求考虑测试的功效。误差统计观点揭示了 P 值的一个关键弱点，并解决了频率派检验中提出的几个基础问题，包括接受和拒绝的谬误以及对观察到的置信区间的误解：见 Mayo-Spanos（2011）。本文还揭示了模型选择程序和假设检验之间的联系，揭示了前者固有的不可靠性。因此，不同程序之间的选择不应是“风格化”的（Murtaugh 2013），而应取决于感兴趣的问题、寻求的答案以及程序的可靠性。