Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA.
Department of Mathematics and Statistics, University of North Carolina at Greensboro, Greensboro, NC, 27402, USA.
Syst Rev. 2024 Aug 21;13(1):219. doi: 10.1186/s13643-024-02609-x.
This study aimed to evaluate the performance of large language models (LLMs) in the task of abstract screening in systematic review and meta-analysis studies, exploring their effectiveness, efficiency, and potential integration into existing human expert-based workflows.
We developed automation scripts in Python to interact with the APIs of several LLM tools, including ChatGPT v4.0, ChatGPT v3.5, Google PaLM 2, and Meta Llama 2, and latest tools including ChatGPT v4.0 turbo, ChatGPT v3.5 turbo, Google Gemini 1.0 pro, Meta Llama 3, and Claude 3. This study focused on three databases of abstracts and used them as benchmarks to evaluate the performance of these LLM tools in terms of sensitivity, specificity, and overall accuracy. The results of the LLM tools were compared to human-curated inclusion decisions, gold standard for systematic review and meta-analysis studies.
Different LLM tools had varying abilities in abstract screening. Chat GPT v4.0 demonstrated remarkable performance, with balanced sensitivity and specificity, and overall accuracy consistently reaching or exceeding 90%, indicating a high potential for LLMs in abstract screening tasks. The study found that LLMs could provide reliable results with minimal human effort and thus serve as a cost-effective and efficient alternative to traditional abstract screening methods.
While LLM tools are not yet ready to completely replace human experts in abstract screening, they show great promise in revolutionizing the process. They can serve as autonomous AI reviewers, contribute to collaborative workflows with human experts, and integrate with hybrid approaches to develop custom tools for increased efficiency. As technology continues to advance, LLMs are poised to play an increasingly important role in abstract screening, reshaping the workflow of systematic review and meta-analysis studies.
本研究旨在评估大型语言模型(LLM)在系统评价和荟萃分析研究的摘要筛选任务中的性能,探索其有效性、效率以及潜在的整合到现有的基于人类专家的工作流程中。
我们使用 Python 开发了自动化脚本,与多个 LLM 工具的 API 进行交互,包括 ChatGPT v4.0、ChatGPT v3.5、Google PaLM 2 和 Meta Lama 2,以及最新的工具包括 ChatGPT v4.0 turbo、ChatGPT v3.5 turbo、Google Gemini 1.0 pro、Meta Lama 3 和 Claude 3。本研究主要关注三个摘要数据库,并将其用作基准,以评估这些 LLM 工具在敏感性、特异性和整体准确性方面的性能。LLM 工具的结果与人工筛选的纳入决策、系统评价和荟萃分析研究的黄金标准进行了比较。
不同的 LLM 工具在摘要筛选方面具有不同的能力。Chat GPT v4.0 表现出出色的性能,具有平衡的敏感性和特异性,整体准确性始终达到或超过 90%,表明 LLM 在摘要筛选任务中具有很高的潜力。研究发现,LLM 可以在最小的人工干预下提供可靠的结果,因此是传统摘要筛选方法的经济高效替代品。
虽然 LLM 工具还没有准备好完全取代人类专家进行摘要筛选,但它们在彻底改变这一过程方面显示出巨大的潜力。它们可以作为自主 AI 审查员,与人类专家的协作工作流程相结合,并整合到混合方法中,开发出提高效率的定制工具。随着技术的不断进步,LLM 有望在摘要筛选中发挥越来越重要的作用,重塑系统评价和荟萃分析研究的工作流程。