大语言模型能否在系统评价中完全自动化或部分辅助论文筛选？

Can large language models fully automate or partially assist paper selection in systematic reviews?

作者信息

Chen Haichao, Jiang Zehua, Liu Xinyu, Xue Can Can, Yew Samantha Min Er, Sheng Bin, Zheng Ying-Feng, Wang Xiaofei, Wu You, Sivaprasad Sobha, Wong Tien Yin, Chaudhary Varun, Tham Yih Chung

机构信息

Tsinghua Medicine, Tsinghua University, Beijing, China.

Institute of Medical Technology, Peking University Health Science Center, Beijing, China.

出版信息

Br J Ophthalmol. 2025 Apr 21. doi: 10.1136/bjo-2024-326254.

DOI:10.1136/bjo-2024-326254

PMID:39814458

Abstract

BACKGROUND/AIMS: Large language models (LLMs) have substantial potential to enhance the efficiency of academic research. The accuracy and performance of LLMs in a systematic review, a core part of evidence building, has yet to be studied in detail.

METHODS

We introduced two LLM-based approaches of systematic review: an LLM-enabled fully automated approach (LLM-FA) utilising three different GPT-4 plugins (Consensus GPT, Scholar GPT and GPT web browsing modes) and an LLM-facilitated semi-automated approach (LLM-SA) using GPT4's Application Programming Interface (API). We benchmarked these approaches using three published systematic reviews that reported the prevalence of diabetic retinopathy across different populations (general population, pregnant women and children).

RESULTS

The three published reviews consisted of 98 papers in total. Across these three reviews, in the LLM-FA approach, Consensus GPT correctly identified 32.7% (32 out of 98) of papers, while Scholar GPT and GPT4's web browsing modes only identified 19.4% (19 out of 98) and 6.1% (6 out of 98), respectively. On the other hand, the LLM-SA approach not only successfully included 82.7% (81 out of 98) of these papers but also correctly excluded 92.2% of 4497 irrelevant papers.

CONCLUSIONS

Our findings suggest LLMs are not yet capable of autonomously identifying and selecting relevant papers in systematic reviews. However, they hold promise as an assistive tool to improve the efficiency of the paper selection process in systematic reviews.

摘要

背景/目的：大语言模型（LLMs）在提高学术研究效率方面具有巨大潜力。作为证据构建核心部分的系统评价中，大语言模型的准确性和性能尚未得到详细研究。

方法

我们引入了两种基于大语言模型的系统评价方法：一种是使用三种不同GPT-4插件（共识GPT、学者GPT和GPT网络浏览模式）的全自动化大语言模型方法（LLM-FA），另一种是使用GPT4应用程序编程接口（API）的半自动化大语言模型方法（LLM-SA）。我们使用三项已发表的系统评价对这些方法进行了基准测试，这些评价报告了不同人群（普通人群、孕妇和儿童）中糖尿病视网膜病变的患病率。

结果

三项已发表的评价总共包含98篇论文。在这三项评价中，在LLM-FA方法中，共识GPT正确识别了98篇论文中的32.7%（32篇），而学者GPT和GPT4的网络浏览模式分别仅识别了98篇中的19.4%（19篇）和6.1%（6篇）。另一方面，LLM-SA方法不仅成功纳入了这些论文中的82.7%（98篇中的81篇），还正确排除了4497篇无关论文中的92.2%。

结论

我们的研究结果表明，大语言模型尚无法在系统评价中自主识别并选择相关论文。然而，它们有望作为一种辅助工具，提高系统评价中论文筛选过程的效率。

相似文献

Can large language models fully automate or partially assist paper selection in systematic reviews?

Br J Ophthalmol. 2025 Apr 21. doi: 10.1136/bjo-2024-326254.

Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study.

J Med Internet Res. 2024 Jan 12;26:e48996. doi: 10.2196/48996.

Implementing Large Language Models in Health Care: Clinician-Focused Review With Interactive Guideline.

J Med Internet Res. 2025 Jul 11;27:e71916. doi: 10.2196/71916.

Interventions to improve safe and effective medicines use by consumers: an overview of systematic reviews.

Cochrane Database Syst Rev. 2014 Apr 29;2014(4):CD007768. doi: 10.1002/14651858.CD007768.pub3.

Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review.

J Med Internet Res. 2024 Nov 7;26:e22769. doi: 10.2196/22769.

High-performance automated abstract screening with large language model ensembles.

J Am Med Inform Assoc. 2025 May 1;32(5):893-904. doi: 10.1093/jamia/ocaf050.

Use of Large Language Models to Classify Epidemiological Characteristics in Synthetic and Real-World Social Media Posts About Conjunctivitis Outbreaks: Infodemiology Study.

J Med Internet Res. 2025 Jul 2;27:e65226. doi: 10.2196/65226.

Large Language Model Symptom Identification From Clinical Text: Multicenter Study.

J Med Internet Res. 2025 Jul 31;27:e72984. doi: 10.2196/72984.

Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.

J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.

Comparison of self-administered survey questionnaire responses collected using mobile apps versus other methods.

Cochrane Database Syst Rev. 2015 Jul 27;2015(7):MR000042. doi: 10.1002/14651858.MR000042.pub2.

引用本文的文献

Large Language Models in Cancer Imaging: Applications and Future Perspectives.

J Clin Med. 2025 May 8;14(10):3285. doi: 10.3390/jcm14103285.

本文引用的文献

Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews.

J Med Internet Res. 2024 Aug 16;26:e52758. doi: 10.2196/52758.

Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study.

J Med Internet Res. 2024 Jan 12;26:e48996. doi: 10.2196/48996.

Autonomous chemical research with large language models.

Nature. 2023 Dec;624(7992):570-578. doi: 10.1038/s41586-023-06792-0. Epub 2023 Dec 20.

Experimental evidence on the productivity effects of generative artificial intelligence.

Science. 2023 Jul 14;381(6654):187-192. doi: 10.1126/science.adh2586. Epub 2023 Jul 13.

Large language models encode clinical knowledge.

Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.

Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers.

NPJ Digit Med. 2023 Apr 26;6(1):75. doi: 10.1038/s41746-023-00819-6.

Global Prevalence of Diabetic Retinopathy in Pediatric Type 2 Diabetes: A Systematic Review and Meta-analysis.

JAMA Netw Open. 2023 Mar 1;6(3):e231887. doi: 10.1001/jamanetworkopen.2023.1887.

Global Estimates of Diabetic Retinopathy Prevalence and Progression in Pregnant Women With Preexisting Diabetes: A Systematic Review and Meta-analysis.

JAMA Ophthalmol. 2022 May 1;140(5):486-494. doi: 10.1001/jamaophthalmol.2022.0050.

Global Prevalence of Diabetic Retinopathy and Projection of Burden through 2045: Systematic Review and Meta-analysis.

Ophthalmology. 2021 Nov;128(11):1580-1591. doi: 10.1016/j.ophtha.2021.04.027. Epub 2021 May 1.

Clinical profile and incidence of microvascular complications of childhood and adolescent onset type 1 and type 2 diabetes seen at a tertiary diabetes center in India.

Pediatr Diabetes. 2021 Feb;22(1):67-74. doi: 10.1111/pedi.13033. Epub 2020 May 29.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

大语言模型能否在系统评价中完全自动化或部分辅助论文筛选？

Can large language models fully automate or partially assist paper selection in systematic reviews?

作者信息

Chen Haichao, Jiang Zehua, Liu Xinyu, Xue Can Can, Yew Samantha Min Er, Sheng Bin, Zheng Ying-Feng, Wang Xiaofei, Wu You, Sivaprasad Sobha, Wong Tien Yin, Chaudhary Varun, Tham Yih Chung

机构信息

Tsinghua Medicine, Tsinghua University, Beijing, China.

Institute of Medical Technology, Peking University Health Science Center, Beijing, China.

出版信息

Br J Ophthalmol. 2025 Apr 21. doi: 10.1136/bjo-2024-326254.

DOI:10.1136/bjo-2024-326254

PMID:39814458

Abstract

METHODS

RESULTS

CONCLUSIONS

摘要

大语言模型能否在系统评价中完全自动化或部分辅助论文筛选？

Can large language models fully automate or partially assist paper selection in systematic reviews?

作者信息

机构信息

出版信息

METHODS

RESULTS

CONCLUSIONS

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

大语言模型能否在系统评价中完全自动化或部分辅助论文筛选？

Can large language models fully automate or partially assist paper selection in systematic reviews?

作者信息

机构信息

出版信息

METHODS

RESULTS

CONCLUSIONS

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献