Rubin H R, Redelmeier D A, Wu A W, Steinberg E P
Division of Internal Medicine, Johns Hopkins University, Baltimore, Maryland 21205.
J Gen Intern Med. 1993 May;8(5):255-8. doi: 10.1007/BF02600092.
To evaluate the interrater reproducibility of scientific abstract review.
Retrospective analysis.
Review for the 1991 Society of General Internal Medicine (SGIM) annual meeting.
426 abstracts in seven topic categories evaluated by 55 reviewers.
Reviewers rated abstracts from 1 (poor) to 5 (excellent), globally and on three specific dimensions: interest to the SGIM audience, quality of methods, and quality of presentation. Each abstract was reviewed by five to seven reviewers. Each reviewer's ratings of the three dimensions were added to compute that reviewer's summary score for a given abstract. The mean of all reviewers' summary scores for an abstract, the final score, was used by SGIM to select abstracts for the meeting.
Final scores ranged from 4.6 to 13.6 (mean = 9.9). Although 222 abstracts (52%) were accepted for publication, the 95% confidence interval around the final score of 300 (70.4%) of the 426 abstracts overlapped with the threshold for acceptance of an abstract. Thus, these abstracts were potentially misclassified. Only 36% of the variance in summary scores was associated with an abstract's identity, 12% with the reviewer's identity, and the remainder with idiosyncratic reviews of abstracts. Global ratings were more reproducible than summary scores.
Reviewers disagreed substantially when evaluating the same abstracts. Future meeting organizers may wish to rank abstracts using global ratings, and to experiment with structured review criteria and other ways to improve raters' agreement.
评估科学摘要评审中评分者间的可重复性。
回顾性分析。
对1991年普通内科医学协会(SGIM)年会的摘要进行评审。
由55名评审员对七个主题类别的426篇摘要进行评估。
评审员对摘要从1分(差)到5分(优)进行整体评分,并在三个特定维度上评分:对SGIM受众的吸引力、方法质量和展示质量。每篇摘要由五至七名评审员评审。将每位评审员在三个维度上的评分相加,计算出该评审员对某一给定摘要的总分。SGIM使用所有评审员对一篇摘要的总分平均值(最终得分)来选择会议摘要。
最终得分范围为4.6至13.6(平均 = 9.9)。虽然222篇摘要(52%)被接受发表,但426篇摘要中300篇(70.4%)的最终得分的95%置信区间与摘要接受阈值重叠。因此,这些摘要可能被错误分类。总分方差中只有36%与摘要本身相关,12%与评审员身份相关,其余与摘要的特殊评审有关。整体评分比总分更具可重复性。
评审员在评估相同摘要时存在很大分歧。未来的会议组织者可能希望使用整体评分对摘要进行排名,并尝试采用结构化评审标准和其他方法来提高评分者之间的一致性。