Max Planck Society, Munich, Germany.
PLoS One. 2010 Dec 14;5(12):e14331. doi: 10.1371/journal.pone.0014331.
This paper presents the first meta-analysis for the inter-rater reliability (IRR) of journal peer reviews. IRR is defined as the extent to which two or more independent reviews of the same scientific document agree.
METHODOLOGY/PRINCIPAL FINDINGS: Altogether, 70 reliability coefficients (Cohen's Kappa, intra-class correlation [ICC], and Pearson product-moment correlation [r]) from 48 studies were taken into account in the meta-analysis. The studies were based on a total of 19,443 manuscripts; on average, each study had a sample size of 311 manuscripts (minimum: 28, maximum: 1983). The results of the meta-analysis confirmed the findings of the narrative literature reviews published to date: The level of IRR (mean ICC/r2=.34, mean Cohen's Kappa=.17) was low. To explain the study-to-study variation of the IRR coefficients, meta-regression analyses were calculated using seven covariates. Two covariates that emerged in the meta-regression analyses as statistically significant to gain an approximate homogeneity of the intra-class correlations indicated that, firstly, the more manuscripts that a study is based on, the smaller the reported IRR coefficients are. Secondly, if the information of the rating system for reviewers was reported in a study, then this was associated with a smaller IRR coefficient than if the information was not conveyed.
CONCLUSIONS/SIGNIFICANCE: Studies that report a high level of IRR are to be considered less credible than those with a low level of IRR. According to our meta-analysis the IRR of peer assessments is quite limited and needs improvement (e.g., reader system).
本文是第一篇针对期刊同行评审者间可靠性(IRR)的荟萃分析。IRR 定义为两个或更多独立评审同一科学文献的一致性程度。
方法/主要发现:荟萃分析共纳入了 48 项研究中的 70 个可靠性系数(Cohen's Kappa、组内相关系数 [ICC] 和 Pearson 积矩相关系数 [r])。这些研究基于总共 19443 篇手稿;平均而言,每项研究的样本量为 311 篇手稿(最小:28,最大:1983)。荟萃分析的结果证实了迄今为止发表的叙述性文献综述的发现:IRR 水平(平均 ICC/r2=.34,平均 Cohen's Kappa=.17)较低。为了解释 IRR 系数的研究间差异,使用七个协变量进行了元回归分析。元回归分析中出现的两个具有统计学意义的协变量表明,首先,研究基于的手稿越多,报告的 IRR 系数越小。其次,如果研究报告了评审员评分系统的信息,则与未传达信息相比,IRR 系数较小。
结论/意义:报告高 IRR 水平的研究被认为不如低 IRR 水平的研究更可信。根据我们的荟萃分析,同行评估的 IRR 相当有限,需要改进(例如,读者系统)。