Pieper Dawid, Jacobs Anja, Weikert Beate, Fishta Alba, Wegewitz Uta
Institute for Research in Operative Medicine, Witten/Herdecke University, Ostmerheimer Str. 200 (Building 38), 51109, Cologne, Germany.
The Federal Joint Committee (G-BA), Wegelystr. 8, 10623, Berlin, Germany.
BMC Med Res Methodol. 2017 Jul 11;17(1):98. doi: 10.1186/s12874-017-0380-y.
Inter-rater reliability (IRR) is mainly assessed based on only two reviewers of unknown expertise. The aim of this paper is to examine differences in the IRR of the Assessment of Multiple Systematic Reviews (AMSTAR) and R(evised)-AMSTAR depending on the pair of reviewers.
Five reviewers independently applied AMSTAR and R-AMSTAR to 16 systematic reviews (eight Cochrane reviews and eight non-Cochrane reviews) from the field of occupational health. Responses were dichotomized and reliability measures were calculated by applying Holsti's method (r) and Cohen's kappa (κ) to all potential pairs of reviewers. Given that five reviewers participated in the study, there were ten possible pairs of reviewers.
Inter-rater reliability varied for AMSTAR between r = 0.82 and r = 0.98 (median r = 0.88) using Holsti's method and κ = 0.41 and κ = 0.69 (median κ = 0.52) using Cohen's kappa and for R-AMSTAR between r = 0.77 and r = 0.89 (median r = 0.82) and κ = 0.32 and κ = 0.67 (median κ = 0.45) depending on the pair of reviewers. The same pair of reviewers yielded the highest IRR for both instruments. Pairwise Cohen's kappa reliability measures showed a moderate correlation between AMSTAR and R-AMSTAR (Spearman's ρ =0.50). The mean inter-rater reliability for AMSTAR was highest for item 1 (κ = 1.00) and item 5 (κ = 0.78), while lowest values were found for items 3, 8, 9 and 11, which showed only fair agreement.
Inter-rater reliability varies widely depending on the pair of reviewers. There may be some shortcomings associated with conducting reliability studies with only two reviewers. Further studies should include additional reviewers and should probably also take account of their level of expertise.
评分者间信度(IRR)主要仅基于两名专业情况不明的评审人员进行评估。本文旨在探讨根据评审人员组合的不同,多重系统评价评估(AMSTAR)和修订版AMSTAR(R-AMSTAR)的IRR差异。
五名评审人员独立将AMSTAR和R-AMSTAR应用于职业健康领域的16项系统评价(八项Cochrane评价和八项非Cochrane评价)。将回答进行二分法处理,并通过对所有可能的评审人员组合应用霍尔斯特方法(r)和科恩kappa系数(κ)来计算信度指标。鉴于有五名评审人员参与该研究,共有十种可能的评审人员组合。
使用霍尔斯特方法时,AMSTAR的评分者间信度在r = 0.82至r = 0.98之间变化(中位数r = 0.88),使用科恩kappa系数时在κ = 0.41至κ = 0.69之间变化(中位数κ = 0.52);对于R-AMSTAR,根据评审人员组合的不同,r在0.77至0.89之间变化(中位数r = 0.82),κ在0.32至0.67之间变化(中位数κ = 0.45)。同一对评审人员对两种工具的IRR均最高。成对的科恩kappa信度指标显示AMSTAR和R-AMSTAR之间存在中度相关性(斯皮尔曼ρ = 0.50)。AMSTAR的平均评分者间信度在第1项(κ = 1.00)和第5项(κ = 0.78)时最高,而在第3、8、9和11项时最低,这些项仅显示出一般的一致性。
评分者间信度因评审人员组合的不同而有很大差异。仅由两名评审人员进行信度研究可能存在一些缺点。进一步的研究应纳入更多评审人员,并且可能还应考虑他们的专业水平。