文献人工综述员间可靠性及对引入机器辅助系统综述的意义：混合方法综述。

Inter-reviewer reliability of human literature reviewing and implications for the introduction of machine-assisted systematic reviews: a mixed-methods review.

机构信息

Pitts, Zeist, The Netherlands.

Health-Ecore, Zeist, The Netherlands.

出版信息

BMJ Open. 2024 Mar 19;14(3):e076912. doi: 10.1136/bmjopen-2023-076912.

DOI:10.1136/bmjopen-2023-076912

PMID:38508610

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10952858/

Abstract

OBJECTIVES

Our main objective is to assess the inter-reviewer reliability (IRR) reported in published systematic literature reviews (SLRs). Our secondary objective is to determine the expected IRR by authors of SLRs for both human and machine-assisted reviews.

METHODS

We performed a review of SLRs of randomised controlled trials using the PubMed and Embase databases. Data were extracted on IRR by means of Cohen's kappa score of abstract/title screening, full-text screening and data extraction in combination with review team size, items screened and the quality of the review was assessed with the A MeaSurement Tool to Assess systematic Reviews 2. In addition, we performed a survey of authors of SLRs on their expectations of machine learning automation and human performed IRR in SLRs.

RESULTS

After removal of duplicates, 836 articles were screened for abstract, and 413 were screened full text. In total, 45 eligible articles were included. The average Cohen's kappa score reported was 0.82 (SD=0.11, n=12) for abstract screening, 0.77 (SD=0.18, n=14) for full-text screening, 0.86 (SD=0.07, n=15) for the whole screening process and 0.88 (SD=0.08, n=16) for data extraction. No association was observed between the IRR reported and review team size, items screened and quality of the SLR. The survey (n=37) showed overlapping expected Cohen's kappa values ranging between approximately 0.6-0.9 for either human or machine learning-assisted SLRs. No trend was observed between reviewer experience and expected IRR. Authors expect a higher-than-average IRR for machine learning-assisted SLR compared with human based SLR in both screening and data extraction.

CONCLUSION

Currently, it is not common to report on IRR in the scientific literature for either human and machine learning-assisted SLRs. This mixed-methods review gives first guidance on the human IRR benchmark, which could be used as a minimal threshold for IRR in machine learning-assisted SLRs.

PROSPERO REGISTRATION NUMBER

CRD42023386706.

摘要

目的

我们的主要目标是评估已发表的系统文献综述（SLR）中报告的评审员间可靠性（IRR）。我们的次要目标是确定 SLR 作者对人类和机器辅助综述的预期 IRR。

方法

我们使用 PubMed 和 Embase 数据库对随机对照试验的 SLR 进行了综述。通过 Cohen 的 Kappa 评分，从摘要/标题筛选、全文筛选和数据提取中提取 IRR 数据，同时考虑综述团队规模、筛选项目以及使用 A 测量工具评估综述的质量。此外，我们对 SLR 作者进行了一项关于他们对机器学习自动化和 SLR 中人工执行的 IRR 的期望的调查。

结果

去除重复项后，对 836 篇文章进行了摘要筛选，对 413 篇文章进行了全文筛选。共有 45 篇符合条件的文章被纳入。报告的平均 Cohen 的 Kappa 评分分别为 0.82（SD=0.11，n=12）用于摘要筛选、0.77（SD=0.18，n=14）用于全文筛选、0.86（SD=0.07，n=15）用于整个筛选过程和 0.88（SD=0.08，n=16）用于数据提取。报告的 IRR 与综述团队规模、筛选项目和 SLR 的质量之间没有关联。调查（n=37）显示，无论是人类还是机器学习辅助的 SLR，预期的 Cohen 的 Kappa 值重叠范围在 0.6-0.9 之间。没有观察到评审员经验和预期 IRR 之间的趋势。与基于人类的 SLR 相比，作者期望机器学习辅助的 SLR 在筛选和数据提取方面具有高于平均水平的 IRR。