• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用大型语言模型进行文献综述的标题和摘要筛选:生物医学领域的探索性研究。

Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain.

机构信息

Department of Radiation Oncology, Cantonal Hospital of St. Gallen, St. Gallen, Switzerland.

Department of Radiation Oncology, Inselspital, Bern University Hospital and University of Bern, Bern, Switzerland.

出版信息

Syst Rev. 2024 Jun 15;13(1):158. doi: 10.1186/s13643-024-02575-4.

DOI:10.1186/s13643-024-02575-4
PMID:38879534
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11180407/
Abstract

BACKGROUND

Systematically screening published literature to determine the relevant publications to synthesize in a review is a time-consuming and difficult task. Large language models (LLMs) are an emerging technology with promising capabilities for the automation of language-related tasks that may be useful for such a purpose.

METHODS

LLMs were used as part of an automated system to evaluate the relevance of publications to a certain topic based on defined criteria and based on the title and abstract of each publication. A Python script was created to generate structured prompts consisting of text strings for instruction, title, abstract, and relevant criteria to be provided to an LLM. The relevance of a publication was evaluated by the LLM on a Likert scale (low relevance to high relevance). By specifying a threshold, different classifiers for inclusion/exclusion of publications could then be defined. The approach was used with four different openly available LLMs on ten published data sets of biomedical literature reviews and on a newly human-created data set for a hypothetical new systematic literature review.

RESULTS

The performance of the classifiers varied depending on the LLM being used and on the data set analyzed. Regarding sensitivity/specificity, the classifiers yielded 94.48%/31.78% for the FlanT5 model, 97.58%/19.12% for the OpenHermes-NeuralChat model, 81.93%/75.19% for the Mixtral model and 97.58%/38.34% for the Platypus 2 model on the ten published data sets. The same classifiers yielded 100% sensitivity at a specificity of 12.58%, 4.54%, 62.47%, and 24.74% on the newly created data set. Changing the standard settings of the approach (minor adaption of instruction prompt and/or changing the range of the Likert scale from 1-5 to 1-10) had a considerable impact on the performance.

CONCLUSIONS

LLMs can be used to evaluate the relevance of scientific publications to a certain review topic and classifiers based on such an approach show some promising results. To date, little is known about how well such systems would perform if used prospectively when conducting systematic literature reviews and what further implications this might have. However, it is likely that in the future researchers will increasingly use LLMs for evaluating and classifying scientific publications.

摘要

背景

系统地筛选已发表的文献以确定与综述相关的出版物是一项耗时且困难的任务。大型语言模型(LLM)是一种新兴技术,具有自动化语言相关任务的潜力,这可能对该目的有用。

方法

LLM 被用作自动化系统的一部分,根据定义的标准,基于每个出版物的标题和摘要,评估出版物与某个主题的相关性。创建了一个 Python 脚本,用于生成由指令、标题、摘要和要提供给 LLM 的相关标准组成的结构化提示。LLM 根据李克特量表(低相关性到高相关性)评估出版物的相关性。通过指定一个阈值,可以为出版物的纳入/排除定义不同的分类器。该方法在四个不同的公开可用的 LLM 上用于十个生物医学文献综述的已发表数据集,以及一个新的人类创建的数据集,用于假设的新系统文献综述。

结果

分类器的性能取决于所使用的 LLM 和分析的数据集。关于敏感性/特异性,在十个已发表的数据集中,FlanT5 模型的分类器产生了 94.48%/31.78%,OpenHermes-NeuralChat 模型的分类器产生了 97.58%/19.12%,Mixtral 模型的分类器产生了 81.93%/75.19%,Platypus 2 模型的分类器产生了 97.58%/38.34%。在新创建的数据集中,相同的分类器在特异性为 12.58%、4.54%、62.47%和 24.74%时产生了 100%的敏感性。改变该方法的标准设置(指令提示的轻微调整和/或将李克特量表的范围从 1-5 更改为 1-10)对性能有很大影响。

结论

LLM 可用于评估科学出版物与特定综述主题的相关性,基于此类方法的分类器显示出一些有希望的结果。迄今为止,人们对这种系统在进行系统文献综述时如果前瞻性地使用会表现如何知之甚少,以及这可能会有什么进一步的影响。然而,未来研究人员很可能会越来越多地使用 LLM 来评估和分类科学出版物。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/687d/11180407/f4824fe8d8c0/13643_2024_2575_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/687d/11180407/19bb5d1ffefd/13643_2024_2575_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/687d/11180407/83e6316d820d/13643_2024_2575_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/687d/11180407/4780bd31048f/13643_2024_2575_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/687d/11180407/d56449d71562/13643_2024_2575_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/687d/11180407/f4824fe8d8c0/13643_2024_2575_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/687d/11180407/19bb5d1ffefd/13643_2024_2575_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/687d/11180407/83e6316d820d/13643_2024_2575_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/687d/11180407/4780bd31048f/13643_2024_2575_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/687d/11180407/d56449d71562/13643_2024_2575_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/687d/11180407/f4824fe8d8c0/13643_2024_2575_Fig5_HTML.jpg

相似文献

1
Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain.使用大型语言模型进行文献综述的标题和摘要筛选:生物医学领域的探索性研究。
Syst Rev. 2024 Jun 15;13(1):158. doi: 10.1186/s13643-024-02575-4.
2
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
3
Performance of a Large Language Model in Screening Citations.大语言模型在引文筛选中的表现。
JAMA Netw Open. 2024 Jul 1;7(7):e2420496. doi: 10.1001/jamanetworkopen.2024.20496.
4
Use of SNOMED CT in Large Language Models: Scoping Review.SNOMED CT 在大语言模型中的应用:范围综述。
JMIR Med Inform. 2024 Oct 7;12:e62924. doi: 10.2196/62924.
5
Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages.大型语言模型能否在系统评价中取代人类?评估 GPT-4 从多种语言的同行评议文献和灰色文献中进行筛选和提取数据的效果。
Res Synth Methods. 2024 Jul;15(4):616-626. doi: 10.1002/jrsm.1715. Epub 2024 Mar 14.
6
Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study.使用大型语言模型对临床综述进行自动化论文筛选:数据分析研究。
J Med Internet Res. 2024 Jan 12;26:e48996. doi: 10.2196/48996.
7
Evaluating the effectiveness of large language models in abstract screening: a comparative analysis.评估大型语言模型在摘要筛选中的有效性:一项对比分析。
Syst Rev. 2024 Aug 21;13(1):219. doi: 10.1186/s13643-024-02609-x.
8
Evaluating large language models for health-related text classification tasks with public social media data.利用公共社交媒体数据评估用于健康相关文本分类任务的大型语言模型。
J Am Med Inform Assoc. 2024 Oct 1;31(10):2181-2189. doi: 10.1093/jamia/ocae210.
9
A question-answering framework for automated abstract screening using large language models.基于大语言模型的自动文摘筛选的问答框架。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1939-1952. doi: 10.1093/jamia/ocae166.
10
A comprehensive evaluation of large Language models on benchmark biomedical text processing tasks.对基准生物医学文本处理任务中大型语言模型的全面评估。
Comput Biol Med. 2024 Mar;171:108189. doi: 10.1016/j.compbiomed.2024.108189. Epub 2024 Feb 20.

引用本文的文献

1
ReviewGenie: a novel automated system for systematic reviews-an exploratory study in speech and language disorders.ReviewGenie:一种用于系统评价的新型自动化系统——言语和语言障碍的探索性研究
Syst Rev. 2025 Aug 18;14(1):167. doi: 10.1186/s13643-025-02895-z.
2
A comparative study of screening performance between abstrackr and GPT models: Systematic review and contextual analysis.Abstrackr与GPT模型筛查性能的比较研究:系统评价与情境分析。
BMC Med Inform Decis Mak. 2025 Aug 7;25(1):293. doi: 10.1186/s12911-025-03138-w.
3
Using GPT-4 for Title and Abstract Screening in a Literature Review of Public Policies: A Feasibility Study.

本文引用的文献

1
Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study.使用大型语言模型对临床综述进行自动化论文筛选:数据分析研究。
J Med Internet Res. 2024 Jan 12;26:e48996. doi: 10.2196/48996.
2
Preventing harm from non-conscious bias in medical generative AI.防止医学生成式人工智能中无意识偏见造成的危害。
Lancet Digit Health. 2024 Jan;6(1):e2-e3. doi: 10.1016/S2589-7500(23)00246-7.
3
Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study.评估 GPT-4 在医疗保健中延续种族和性别偏见的潜力:一项模型评估研究。
在公共政策文献综述中使用GPT-4进行标题和摘要筛选:一项可行性研究。
Cochrane Evid Synth Methods. 2025 May 22;3(3):e70031. doi: 10.1002/cesm.70031. eCollection 2025 May.
4
Summarizing clinical evidence utilizing large language models for cancer treatments: a blinded comparative analysis.利用大语言模型总结癌症治疗的临床证据:一项盲法对比分析。
Front Digit Health. 2025 Apr 29;7:1569554. doi: 10.3389/fdgth.2025.1569554. eCollection 2025.
5
Streamlining systematic reviews with large language models using prompt engineering and retrieval augmented generation.使用提示工程和检索增强生成技术,通过大语言模型简化系统评价。
BMC Med Res Methodol. 2025 May 10;25(1):130. doi: 10.1186/s12874-025-02583-5.
6
Delving into the Practical Applications and Pitfalls of Large Language Models in Medical Education: Narrative Review.深入探讨大语言模型在医学教育中的实际应用与陷阱:叙述性综述
Adv Med Educ Pract. 2025 Apr 18;16:625-636. doi: 10.2147/AMEP.S497020. eCollection 2025.
7
Large language model-generated clinical practice guideline for appendicitis.大型语言模型生成的阑尾炎临床实践指南。
Surg Endosc. 2025 Jun;39(6):3539-3551. doi: 10.1007/s00464-025-11723-3. Epub 2025 Apr 18.
8
Artificial intelligence driven platform for rapid catalytic performance assessment of nanozymes.用于纳米酶快速催化性能评估的人工智能驱动平台。
Sci Rep. 2025 Apr 17;15(1):13305. doi: 10.1038/s41598-025-96815-9.
9
The role of large language models in the peer-review process: opportunities and challenges for medical journal reviewers and editors.大语言模型在同行评审过程中的作用:医学期刊审稿人和编辑面临的机遇与挑战。
J Educ Eval Health Prof. 2025;22:4. doi: 10.3352/jeehp.2025.22.4. Epub 2025 Jan 16.
10
GPT-3.5 Turbo and GPT-4 Turbo in Title and Abstract Screening for Systematic Reviews.GPT-3.5 Turbo和GPT-4 Turbo在系统评价的标题和摘要筛选中的应用
JMIR Med Inform. 2025 Mar 12;13:e64682. doi: 10.2196/64682.
Lancet Digit Health. 2024 Jan;6(1):e12-e22. doi: 10.1016/S2589-7500(23)00225-X.
4
ChatGPT and science: the AI system was a force in 2023 - for good and bad.ChatGPT与科学:人工智能系统在2023年是一股力量——有好有坏。
Nature. 2023 Dec;624(7992):509. doi: 10.1038/d41586-023-03930-6.
5
Publish with AUTOGEN or Perish? Some Pitfalls to Avoid in the Pursuit of Academic Enhancement via Personalized Large Language Models.与自动生成工具合作还是被淘汰?在通过个性化大语言模型追求学术提升过程中需避免的一些陷阱。
Am J Bioeth. 2023 Oct;23(10):94-96. doi: 10.1080/15265161.2023.2250291. Epub 2023 Oct 9.
6
Revisiting Relation Extraction in the era of Large Language Models.重访大语言模型时代的关系抽取
Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:15566-15589. doi: 10.18653/v1/2023.acl-long.868.
7
Evaluating large language models on medical evidence summarization.基于医学证据总结对大语言模型进行评估。
NPJ Digit Med. 2023 Aug 24;6(1):158. doi: 10.1038/s41746-023-00896-7.
8
The use of artificial intelligence for automating or semi-automating biomedical literature analyses: A scoping review.人工智能在自动化或半自动化生物医学文献分析中的应用:范围综述。
J Biomed Inform. 2023 Jun;142:104389. doi: 10.1016/j.jbi.2023.104389. Epub 2023 May 13.
9
Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.比较医生和人工智能聊天机器人对发布在公共社交媒体论坛上的患者问题的回复。
JAMA Intern Med. 2023 Jun 1;183(6):589-596. doi: 10.1001/jamainternmed.2023.1838.
10
Overlap in meaning is a stronger predictor of semantic activation in GPT-3 than in humans.在 GPT-3 中,意义重叠比人类更能预测语义激活。
Sci Rep. 2023 Mar 28;13(1):5035. doi: 10.1038/s41598-023-32248-6.