大型语言模型能否在系统评价中取代人类？评估 GPT-4 从多种语言的同行评议文献和灰色文献中进行筛选和提取数据的效果。

Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages.

机构信息

Trinity Centre for Global Health, Trinity College Dublin, Dublin, Ireland.

School of Psychology, Trinity College Dublin, Dublin, Ireland.

出版信息

Res Synth Methods. 2024 Jul;15(4):616-626. doi: 10.1002/jrsm.1715. Epub 2024 Mar 14.

Abstract

Systematic reviews are vital for guiding practice, research and policy, although they are often slow and labour-intensive. Large language models (LLMs) could speed up and automate systematic reviews, but their performance in such tasks has yet to be comprehensively evaluated against humans, and no study has tested Generative Pre-Trained Transformer (GPT)-4, the biggest LLM so far. This pre-registered study uses a "human-out-of-the-loop" approach to evaluate GPT-4's capability in title/abstract screening, full-text review and data extraction across various literature types and languages. Although GPT-4 had accuracy on par with human performance in some tasks, results were skewed by chance agreement and dataset imbalance. Adjusting for these caused performance scores to drop across all stages: for data extraction, performance was moderate, and for screening, it ranged from none in highly balanced literature datasets (~~1:1) to moderate in those datasets where the ratio of inclusion to exclusion in studies was imbalanced (~~1:3). When screening full-text literature using highly reliable prompts, GPT-4's performance was more robust, reaching "human-like" levels. Although our findings indicate that, currently, substantial caution should be exercised if LLMs are being used to conduct systematic reviews, they also offer preliminary evidence that, for certain review tasks delivered under specific conditions, LLMs can rival human performance.

摘要

系统评价对于指导实践、研究和政策至关重要，但它们通常耗时且劳动强度大。大型语言模型（LLMs）可以加快和自动化系统评价，但它们在这些任务中的性能尚未与人的表现进行全面评估，也没有研究测试过迄今为止最大的 LLM——Generative Pre-Trained Transformer (GPT)-4。本预先注册的研究使用“人机分离”方法来评估 GPT-4 在各种文献类型和语言中的标题/摘要筛选、全文审查和数据提取能力。尽管 GPT-4 在某些任务中的准确性与人类表现相当，但结果受到偶然一致性和数据集不平衡的影响。对这些因素进行调整导致所有阶段的性能得分下降：在数据提取方面，性能为中等，在筛选方面，在研究中纳入与排除的比例不平衡（~~1:3）的数据集上表现为中等，而在高度平衡的文献数据集（~~1:1）上则表现为无。当使用高度可靠的提示筛选全文文献时，GPT-4 的性能更加稳健，达到“类人”水平。虽然我们的研究结果表明，如果要使用 LLM 进行系统评价，目前应谨慎行事，但它们也初步表明，在某些特定条件下提供的特定审查任务中，LLM 可以与人类表现相媲美。