• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大型语言模型在通过标题和摘要筛选确定合格研究方面的人类可比敏感性:使用 GPT-3.5 和 GPT-4 进行系统评价的 3 层策略。

Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews.

机构信息

Department of Clinical Laboratory, National Center Hospital, National Center of Neurology and Psychiatry, Kodaira, Japan.

Department of Sleep-Wake Disorders, National Institute of Mental Health, National Center of Neurology and Psychiatry, Kodaira, Japan.

出版信息

J Med Internet Res. 2024 Aug 16;26:e52758. doi: 10.2196/52758.

DOI:10.2196/52758
PMID:39151163
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11364944/
Abstract

BACKGROUND

The screening process for systematic reviews is resource-intensive. Although previous machine learning solutions have reported reductions in workload, they risked excluding relevant papers.

OBJECTIVE

We evaluated the performance of a 3-layer screening method using GPT-3.5 and GPT-4 to streamline the title and abstract-screening process for systematic reviews. Our goal is to develop a screening method that maximizes sensitivity for identifying relevant records.

METHODS

We conducted screenings on 2 of our previous systematic reviews related to the treatment of bipolar disorder, with 1381 records from the first review and 3146 from the second. Screenings were conducted using GPT-3.5 (gpt-3.5-turbo-0125) and GPT-4 (gpt-4-0125-preview) across three layers: (1) research design, (2) target patients, and (3) interventions and controls. The 3-layer screening was conducted using prompts tailored to each study. During this process, information extraction according to each study's inclusion criteria and optimization for screening were carried out using a GPT-4-based flow without manual adjustments. Records were evaluated at each layer, and those meeting the inclusion criteria at all layers were subsequently judged as included.

RESULTS

On each layer, both GPT-3.5 and GPT-4 were able to process about 110 records per minute, and the total time required for screening the first and second studies was approximately 1 hour and 2 hours, respectively. In the first study, the sensitivities/specificities of the GPT-3.5 and GPT-4 were 0.900/0.709 and 0.806/0.996, respectively. Both screenings by GPT-3.5 and GPT-4 judged all 6 records used for the meta-analysis as included. In the second study, the sensitivities/specificities of the GPT-3.5 and GPT-4 were 0.958/0.116 and 0.875/0.855, respectively. The sensitivities for the relevant records align with those of human evaluators: 0.867-1.000 for the first study and 0.776-0.979 for the second study. Both screenings by GPT-3.5 and GPT-4 judged all 9 records used for the meta-analysis as included. After accounting for justifiably excluded records by GPT-4, the sensitivities/specificities of the GPT-4 screening were 0.962/0.996 in the first study and 0.943/0.855 in the second study. Further investigation indicated that the cases incorrectly excluded by GPT-3.5 were due to a lack of domain knowledge, while the cases incorrectly excluded by GPT-4 were due to misinterpretations of the inclusion criteria.

CONCLUSIONS

Our 3-layer screening method with GPT-4 demonstrated acceptable level of sensitivity and specificity that supports its practical application in systematic review screenings. Future research should aim to generalize this approach and explore its effectiveness in diverse settings, both medical and nonmedical, to fully establish its use and operational feasibility.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b905/11364944/da26bf3a4841/jmir_v26i1e52758_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b905/11364944/dff2f9ba626f/jmir_v26i1e52758_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b905/11364944/b35da5f36209/jmir_v26i1e52758_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b905/11364944/da26bf3a4841/jmir_v26i1e52758_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b905/11364944/dff2f9ba626f/jmir_v26i1e52758_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b905/11364944/b35da5f36209/jmir_v26i1e52758_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b905/11364944/da26bf3a4841/jmir_v26i1e52758_fig3.jpg
摘要

背景

系统评价的筛选过程需要耗费大量资源。尽管之前的机器学习解决方案已经报告了工作量的减少,但它们有排除相关文献的风险。

目的

我们评估了使用 GPT-3.5 和 GPT-4 的 3 层筛选方法在系统评价标题和摘要筛选过程中的性能。我们的目标是开发一种最大限度提高识别相关记录灵敏度的筛选方法。

方法

我们对我们之前的两项关于双相情感障碍治疗的系统评价进行了筛选,第一项研究有 1381 条记录,第二项研究有 3146 条记录。使用 GPT-3.5(gpt-3.5-turbo-0125)和 GPT-4(gpt-4-0125-preview)进行了 3 层筛选:(1)研究设计,(2)目标患者,和(3)干预和对照。对每一项研究进行了定制的 3 层筛选提示。在此过程中,使用基于 GPT-4 的流程进行了根据每项研究的纳入标准进行的信息提取和优化,无需手动调整。对每个层的记录进行评估,在所有层都符合纳入标准的记录随后被判断为纳入。

结果

在每个层,GPT-3.5 和 GPT-4 都能够每分钟处理约 110 条记录,分别对第一和第二项研究进行筛选的总时间约为 1 小时和 2 小时。在第一项研究中,GPT-3.5 和 GPT-4 的灵敏度/特异性分别为 0.900/0.709 和 0.806/0.996。GPT-3.5 和 GPT-4 的两次筛选都判断用于荟萃分析的所有 6 条记录均为纳入。在第二项研究中,GPT-3.5 和 GPT-4 的灵敏度/特异性分别为 0.958/0.116 和 0.875/0.855。GPT-3.5 和 GPT-4 的灵敏度与人类评估者的灵敏度一致:第一项研究为 0.867-1.000,第二项研究为 0.776-0.979。GPT-3.5 和 GPT-4 的两次筛选都判断用于荟萃分析的所有 9 条记录均为纳入。在考虑了 GPT-4 合理排除的记录后,GPT-4 筛选的灵敏度/特异性分别为第一项研究中的 0.962/0.996 和第二项研究中的 0.943/0.855。进一步的调查表明,GPT-3.5 错误排除的病例是由于缺乏领域知识,而 GPT-4 错误排除的病例是由于对纳入标准的误解。

结论

我们的 GPT-4 3 层筛选方法具有可接受的灵敏度和特异性,支持其在系统评价筛选中的实际应用。未来的研究应旨在推广这种方法,并探索其在医学和非医学领域的有效性,以充分建立其使用和操作可行性。

相似文献

1
Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews.大型语言模型在通过标题和摘要筛选确定合格研究方面的人类可比敏感性:使用 GPT-3.5 和 GPT-4 进行系统评价的 3 层策略。
J Med Internet Res. 2024 Aug 16;26:e52758. doi: 10.2196/52758.
2
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
3
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
4
Regional cerebral blood flow single photon emission computed tomography for detection of Frontotemporal dementia in people with suspected dementia.用于检测疑似痴呆患者额颞叶痴呆的局部脑血流单光子发射计算机断层扫描
Cochrane Database Syst Rev. 2015 Jun 23;2015(6):CD010896. doi: 10.1002/14651858.CD010896.pub2.
5
Rapid, point-of-care antigen tests for diagnosis of SARS-CoV-2 infection.用于 SARS-CoV-2 感染诊断的快速、即时抗原检测。
Cochrane Database Syst Rev. 2022 Jul 22;7(7):CD013705. doi: 10.1002/14651858.CD013705.pub3.
6
Search strategies to identify diagnostic accuracy studies in MEDLINE and EMBASE.在MEDLINE和EMBASE中识别诊断准确性研究的检索策略。
Cochrane Database Syst Rev. 2013 Sep 11;2013(9):MR000022. doi: 10.1002/14651858.MR000022.pub3.
7
Home treatment for mental health problems: a systematic review.心理健康问题的居家治疗:一项系统综述
Health Technol Assess. 2001;5(15):1-139. doi: 10.3310/hta5150.
8
Magnetic resonance perfusion for differentiating low-grade from high-grade gliomas at first presentation.首次就诊时磁共振灌注成像用于鉴别低级别与高级别胶质瘤
Cochrane Database Syst Rev. 2018 Jan 22;1(1):CD011551. doi: 10.1002/14651858.CD011551.pub2.
9
Plasma and cerebrospinal fluid amyloid beta for the diagnosis of Alzheimer's disease dementia and other dementias in people with mild cognitive impairment (MCI).血浆和脑脊液β淀粉样蛋白用于诊断轻度认知障碍(MCI)患者的阿尔茨海默病性痴呆及其他痴呆。
Cochrane Database Syst Rev. 2014 Jun 10;2014(6):CD008782. doi: 10.1002/14651858.CD008782.pub4.
10
Antibody tests for identification of current and past infection with SARS-CoV-2.抗体检测用于鉴定 SARS-CoV-2 的现症感染和既往感染。
Cochrane Database Syst Rev. 2022 Nov 17;11(11):CD013652. doi: 10.1002/14651858.CD013652.pub2.

引用本文的文献

1
A comparative study of screening performance between abstrackr and GPT models: Systematic review and contextual analysis.Abstrackr与GPT模型筛查性能的比较研究:系统评价与情境分析。
BMC Med Inform Decis Mak. 2025 Aug 7;25(1):293. doi: 10.1186/s12911-025-03138-w.
2
Streamlining systematic reviews with large language models using prompt engineering and retrieval augmented generation.使用提示工程和检索增强生成技术,通过大语言模型简化系统评价。
BMC Med Res Methodol. 2025 May 10;25(1):130. doi: 10.1186/s12874-025-02583-5.
3
High-performance automated abstract screening with large language model ensembles.

本文引用的文献

1
Sensitivity and Specificity of Using GPT-3.5 Turbo Models for Title and Abstract Screening in Systematic Reviews and Meta-analyses.使用 GPT-3.5 Turbo 模型进行系统评价和荟萃分析的标题和摘要筛选的灵敏度和特异性。
Ann Intern Med. 2024 Jun;177(6):791-799. doi: 10.7326/M23-3389. Epub 2024 May 21.
2
Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages.大型语言模型能否在系统评价中取代人类?评估 GPT-4 从多种语言的同行评议文献和灰色文献中进行筛选和提取数据的效果。
Res Synth Methods. 2024 Jul;15(4):616-626. doi: 10.1002/jrsm.1715. Epub 2024 Mar 14.
3
使用大语言模型集成进行高性能自动摘要筛选。
J Am Med Inform Assoc. 2025 May 1;32(5):893-904. doi: 10.1093/jamia/ocaf050.
4
GPT-3.5 Turbo and GPT-4 Turbo in Title and Abstract Screening for Systematic Reviews.GPT-3.5 Turbo和GPT-4 Turbo在系统评价的标题和摘要筛选中的应用
JMIR Med Inform. 2025 Mar 12;13:e64682. doi: 10.2196/64682.
5
Can large language models fully automate or partially assist paper selection in systematic reviews?大语言模型能否在系统评价中完全自动化或部分辅助论文筛选?
Br J Ophthalmol. 2025 Apr 21. doi: 10.1136/bjo-2024-326254.
6
A comprehensive evaluation of large language models in mining gene relations and pathway knowledge.大型语言模型在挖掘基因关系和通路知识方面的综合评估。
Quant Biol. 2024 Dec;12(4):360-374. doi: 10.1002/qub2.57. Epub 2024 Jun 21.
Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study.
使用大型语言模型对临床综述进行自动化论文筛选:数据分析研究。
J Med Internet Res. 2024 Jan 12;26:e48996. doi: 10.2196/48996.
4
Enhancing title and abstract screening for systematic reviews with GPT-3.5 turbo.使用GPT-3.5 turbo加强系统评价的标题和摘要筛选
BMJ Evid Based Med. 2024 Jan 19;29(1):69-70. doi: 10.1136/bmjebm-2023-112678.
5
Performance of ChatGPT in medical examinations: A systematic review and a meta-analysis.ChatGPT在医学考试中的表现:系统评价与荟萃分析。
BJOG. 2024 Feb;131(3):378-380. doi: 10.1111/1471-0528.17641. Epub 2023 Aug 21.
6
Implications of Nonhuman "Authors".非人类“作者”的影响。
JAMA. 2023 Aug 8;330(6):566. doi: 10.1001/jama.2023.10568.
7
Prompt Engineering with ChatGPT: A Guide for Academic Writers.《ChatGPT 提示工程:学术写作者指南》
Ann Biomed Eng. 2023 Dec;51(12):2629-2633. doi: 10.1007/s10439-023-03272-4. Epub 2023 Jun 7.
8
Using ChatGPT for language editing in scientific articles.在科学文章中使用ChatGPT进行语言编辑。
Maxillofac Plast Reconstr Surg. 2023 Mar 8;45(1):13. doi: 10.1186/s40902-023-00381-x.
9
Screening for in vitro systematic reviews: a comparison of screening methods and training of a machine learning classifier.体外系统评价筛查:筛查方法比较及机器学习分类器的训练。
Clin Sci (Lond). 2023 Jan 31;137(2):181-193. doi: 10.1042/CS20220594.
10
Machine learning computational tools to assist the performance of systematic reviews: A mapping review.机器学习计算工具辅助系统评价的实施:一项映射综述。
BMC Med Res Methodol. 2022 Dec 16;22(1):322. doi: 10.1186/s12874-022-01805-4.