Chen Haichao, Jiang Zehua, Liu Xinyu, Xue Can Can, Yew Samantha Min Er, Sheng Bin, Zheng Ying-Feng, Wang Xiaofei, Wu You, Sivaprasad Sobha, Wong Tien Yin, Chaudhary Varun, Tham Yih Chung
Tsinghua Medicine, Tsinghua University, Beijing, China.
Institute of Medical Technology, Peking University Health Science Center, Beijing, China.
Br J Ophthalmol. 2025 Apr 21. doi: 10.1136/bjo-2024-326254.
BACKGROUND/AIMS: Large language models (LLMs) have substantial potential to enhance the efficiency of academic research. The accuracy and performance of LLMs in a systematic review, a core part of evidence building, has yet to be studied in detail.
We introduced two LLM-based approaches of systematic review: an LLM-enabled fully automated approach (LLM-FA) utilising three different GPT-4 plugins (Consensus GPT, Scholar GPT and GPT web browsing modes) and an LLM-facilitated semi-automated approach (LLM-SA) using GPT4's Application Programming Interface (API). We benchmarked these approaches using three published systematic reviews that reported the prevalence of diabetic retinopathy across different populations (general population, pregnant women and children).
The three published reviews consisted of 98 papers in total. Across these three reviews, in the LLM-FA approach, Consensus GPT correctly identified 32.7% (32 out of 98) of papers, while Scholar GPT and GPT4's web browsing modes only identified 19.4% (19 out of 98) and 6.1% (6 out of 98), respectively. On the other hand, the LLM-SA approach not only successfully included 82.7% (81 out of 98) of these papers but also correctly excluded 92.2% of 4497 irrelevant papers.
Our findings suggest LLMs are not yet capable of autonomously identifying and selecting relevant papers in systematic reviews. However, they hold promise as an assistive tool to improve the efficiency of the paper selection process in systematic reviews.
背景/目的:大语言模型(LLMs)在提高学术研究效率方面具有巨大潜力。作为证据构建核心部分的系统评价中,大语言模型的准确性和性能尚未得到详细研究。
我们引入了两种基于大语言模型的系统评价方法:一种是使用三种不同GPT-4插件(共识GPT、学者GPT和GPT网络浏览模式)的全自动化大语言模型方法(LLM-FA),另一种是使用GPT4应用程序编程接口(API)的半自动化大语言模型方法(LLM-SA)。我们使用三项已发表的系统评价对这些方法进行了基准测试,这些评价报告了不同人群(普通人群、孕妇和儿童)中糖尿病视网膜病变的患病率。
三项已发表的评价总共包含98篇论文。在这三项评价中,在LLM-FA方法中,共识GPT正确识别了98篇论文中的32.7%(32篇),而学者GPT和GPT4的网络浏览模式分别仅识别了98篇中的19.4%(19篇)和6.1%(6篇)。另一方面,LLM-SA方法不仅成功纳入了这些论文中的82.7%(98篇中的81篇),还正确排除了4497篇无关论文中的92.2%。
我们的研究结果表明,大语言模型尚无法在系统评价中自主识别并选择相关论文。然而,它们有望作为一种辅助工具,提高系统评价中论文筛选过程的效率。