Centre for Computational Science and Mathematical Modelling, Coventry University, Coventry CV1 2TT, United Kingdom.
Information School, The University of Sheffield, Sheffield S10 2AH, United Kingdom.
J Am Med Inform Assoc. 2024 Sep 1;31(9):1939-1952. doi: 10.1093/jamia/ocae166.
This paper aims to address the challenges in abstract screening within systematic reviews (SR) by leveraging the zero-shot capabilities of large language models (LLMs).
We employ LLM to prioritize candidate studies by aligning abstracts with the selection criteria outlined in an SR protocol. Abstract screening was transformed into a novel question-answering (QA) framework, treating each selection criterion as a question addressed by LLM. The framework involves breaking down the selection criteria into multiple questions, properly prompting LLM to answer each question, scoring and re-ranking each answer, and combining the responses to make nuanced inclusion or exclusion decisions.
Large-scale validation was performed on the benchmark of CLEF eHealth 2019 Task 2: Technology-Assisted Reviews in Empirical Medicine. Focusing on GPT-3.5 as a case study, the proposed QA framework consistently exhibited a clear advantage over traditional information retrieval approaches and bespoke BERT-family models that were fine-tuned for prioritizing candidate studies (ie, from the BERT to PubMedBERT) across 31 datasets of 4 categories of SRs, underscoring their high potential in facilitating abstract screening. The experiments also showcased the viability of using selection criteria as a query for reference prioritization. The experiments also showcased the viability of the framework using different LLMs.
Investigation justified the indispensable value of leveraging selection criteria to improve the performance of automated abstract screening. LLMs demonstrated proficiency in prioritizing candidate studies for abstract screening using the proposed QA framework. Significant performance improvements were obtained by re-ranking answers using the semantic alignment between abstracts and selection criteria. This further highlighted the pertinence of utilizing selection criteria to enhance abstract screening.
本文旨在利用大型语言模型(LLM)的零样本能力,解决系统评价(SR)中摘要筛选的挑战。
我们采用 LLM 通过将摘要与 SR 方案中概述的选择标准对齐,对候选研究进行优先级排序。摘要筛选被转化为一种新的问答(QA)框架,将每个选择标准视为 LLM 回答的一个问题。该框架涉及将选择标准分解为多个问题,正确提示 LLM 回答每个问题,对每个答案进行评分和重新排序,并结合答案做出细致的纳入或排除决策。
在 CLEF eHealth 2019 任务 2:实证医学中的技术辅助评价的基准上进行了大规模验证。以 GPT-3.5 为例进行研究,所提出的 QA 框架在 31 个 4 类 SR 的数据集上始终表现出明显优于传统信息检索方法和专门为候选研究优先级排序(即从 BERT 到 PubMedBERT)微调的 BERT 系列模型的优势,凸显了其在促进摘要筛选方面的巨大潜力。实验还展示了使用选择标准作为查询进行参考优先级排序的可行性。实验还展示了使用不同 LLM 的框架的可行性。
调查证明了利用选择标准来提高自动化摘要筛选性能的不可或缺的价值。LLM 展示了在使用所提出的 QA 框架进行摘要筛选候选研究优先级排序方面的卓越能力。通过使用摘要和选择标准之间的语义对齐对答案进行重新排序,获得了显著的性能提升。这进一步强调了利用选择标准来增强摘要筛选的相关性。