Kern Center for the Science of Healthcare Delivery, Mayo Clinic, Rochester, Minnesota, USA
Public Health, Infectious Diseases and Occupational Medicine, Mayo Clinic, Rochester, Minnesota, USA.
BMJ Evid Based Med. 2024 Nov 22;29(6):394-398. doi: 10.1136/bmjebm-2023-112597.
Large language models (LLMs) may facilitate and expedite systematic reviews, although the approach to integrate LLMs in the review process is unclear. This study evaluates GPT-4 agreement with human reviewers in assessing the risk of bias using the Risk Of Bias In Non-randomised Studies of Interventions (ROBINS-I) tool and proposes a framework for integrating LLMs into systematic reviews. The case study demonstrated that raw per cent agreement was the highest for the ROBINS-I domain of 'Classification of Intervention'. Kendall agreement coefficient was highest for the domains of 'Participant Selection', 'Missing Data' and 'Measurement of Outcomes', suggesting moderate agreement in these domains. Raw agreement about the overall risk of bias across domains was 61% (Kendall coefficient=0.35). The proposed framework for integrating LLMs into systematic reviews consists of four domains: rationale for LLM use, protocol (task definition, model selection, prompt engineering, data entry methods, human role and success metrics), execution (iterative revisions to the protocol) and reporting. We identify five basic task types relevant to systematic reviews: selection, extraction, judgement, analysis and narration. Considering the agreement level with a human reviewer in the case study, pairing artificial intelligence with an independent human reviewer remains required.
大型语言模型 (LLMs) 可以促进和加快系统评价,尽管将 LLM 整合到评价过程中的方法尚不清楚。本研究评估了 GPT-4 在使用风险偏倚评估工具(非随机干预研究的风险偏倚工具,ROBINS-I)评估风险偏倚时与人类评价者的一致性,并提出了将 LLM 整合到系统评价中的框架。案例研究表明,在“干预分类”这一 ROBINS-I 领域,原始百分比一致性最高。Kendall 一致性系数在“参与者选择”、“数据缺失”和“结果测量”等领域最高,表明这些领域存在中度一致性。在各领域对整体风险偏倚的原始一致性为 61%(Kendall 系数=0.35)。将 LLM 整合到系统评价中的框架包括四个领域:使用 LLM 的理由、协议(任务定义、模型选择、提示工程、数据输入方法、人类角色和成功指标)、执行(对协议的迭代修订)和报告。我们确定了与系统评价相关的五种基本任务类型:选择、提取、判断、分析和叙述。考虑到案例研究中与人类评价者的一致性水平,仍然需要将人工智能与独立的人类评价者配对使用。