George & Fay Yee Center for Healthcare Innovation, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, Manitoba R3E 0T6, Canada; Department of Community Health Sciences, Max Rady College of Medicine, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, Manitoba R3E 0T6, Canada.
George & Fay Yee Center for Healthcare Innovation, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, Manitoba R3E 0T6, Canada; Department of Community Health Sciences, Max Rady College of Medicine, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, Manitoba R3E 0T6, Canada.
J Clin Epidemiol. 2020 Dec;128:140-147. doi: 10.1016/j.jclinepi.2020.09.033. Epub 2020 Sep 25.
To assess the real-world interrater reliability (IRR), interconsensus reliability (ICR), and evaluator burden of the Risk of Bias (RoB) in Nonrandomized Studies (NRS) of Interventions (ROBINS-I), and the ROB Instrument for NRS of Exposures (ROB-NRSE) tools.
A six-center cross-sectional study with seven reviewers (2 reviewer pairs) assessing the RoB using ROBINS-I (n = 44 NRS) or ROB-NRSE (n = 44 NRS). We used Gwet's AC statistic to calculate the IRR and ICR. To measure the evaluator burden, we assessed the total time taken to apply the tool and reach a consensus.
For ROBINS-I, both IRR and ICR for individual domains ranged from poor to substantial agreement. IRR and ICR on overall RoB were poor. The evaluator burden was 48.45 min (95% CI 45.61 to 51.29). For ROB-NRSE, the IRR and ICR for the majority of domains were poor, while the rest ranged from fair to perfect agreement. IRR and ICR on overall RoB were slight and poor, respectively. The evaluator burden was 36.98 min (95% CI 34.80 to 39.16).
We found both tools to have low reliability, although ROBINS-I was slightly higher. Measures to increase agreement between raters (e.g., detailed training, supportive guidance material) may improve reliability and decrease evaluator burden.
评估风险偏倚(RoB)在非随机干预研究(NRS)中的评估者间可靠性(IRR)、一致性(ICR)和评价者负担,以及暴露 NRS 的 ROB 工具(ROB-NRSE)。
一项六中心横断面研究,有 7 名评估者(2 对评估者)使用 ROBINS-I(n=44 项 NRS)或 ROB-NRSE(n=44 项 NRS)评估 RoB。我们使用 Gwet 的 AC 统计来计算 IRR 和 ICR。为了衡量评价者负担,我们评估了应用工具和达成共识所花费的总时间。
对于 ROBINS-I,各个领域的 IRR 和 ICR 从差到中等一致不等。整体 RoB 的 IRR 和 ICR 较差。评价者负担为 48.45 分钟(95%CI 45.61 至 51.29)。对于 ROB-NRSE,大多数领域的 IRR 和 ICR 较差,而其余领域的 IRR 和 ICR 则从公平到完美一致不等。整体 RoB 的 IRR 和 ICR 分别为轻微和较差。评价者负担为 36.98 分钟(95%CI 34.80 至 39.16)。
我们发现这两种工具的可靠性都较低,尽管 ROBINS-I 略高一些。采取措施提高评估者之间的一致性(例如,详细的培训、支持性指导材料)可能会提高可靠性并降低评价者负担。