• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

缩小用于医学证据总结的开源和商业大语言模型之间的差距。

Closing the gap between open-source and commercial large language models for medical evidence summarization.

作者信息

Zhang Gongbo, Jin Qiao, Zhou Yiliang, Wang Song, Idnay Betina R, Luo Yiming, Park Elizabeth, Nestor Jordan G, Spotnitz Matthew E, Soroush Ali, Campion Thomas, Lu Zhiyong, Weng Chunhua, Peng Yifan

机构信息

Department of Biomedical Informatics, Columbia University, New York, NY, USA.

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

出版信息

ArXiv. 2024 Jul 25:arXiv:2408.00588v1.

PMID:39371088
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11451644/
Abstract

Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to proprietary ones. In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance in summarizing medical evidence. Utilizing a benchmark dataset, MedReview, consisting of 8,161 pairs of systematic reviews and summaries, we fine-tuned three broadly-used, open-sourced LLMs, namely PRIMERA, LongT5, and Llama-2. Overall, the fine-tuned LLMs obtained an increase of 9.89 in ROUGE-L (95% confidence interval: 8.94-10.81), 13.21 in METEOR score (95% confidence interval: 12.05-14.37), and 15.82 in CHRF score (95% confidence interval: 13.89-16.44). The performance of fine-tuned LongT5 is close to GPT-3.5 with zero-shot settings. Furthermore, smaller fine-tuned models sometimes even demonstrated superior performance compared to larger zero-shot models. The above trends of improvement were also manifested in both human and GPT4-simulated evaluations. Our results can be applied to guide model selection for tasks demanding particular domain knowledge, such as medical evidence summarization.

摘要

大语言模型(LLMs)在总结医学证据方面具有巨大潜力。最近的研究大多集中在专有大语言模型的应用上。使用专有大语言模型会引入多种风险因素,包括缺乏透明度和对供应商的依赖。虽然开源大语言模型具有更高的透明度和可定制性,但其性能与专有模型相比仍有差距。在本研究中,我们调查了对开源大语言模型进行微调在多大程度上可以进一步提高其在总结医学证据方面的性能。我们利用一个基准数据集MedReview,该数据集由8161对系统评价和总结组成,对三个广泛使用的开源大语言模型PRIMERA、LongT5和Llama-2进行了微调。总体而言,微调后的大语言模型在ROUGE-L指标上提高了9.89(95%置信区间:8.94 - 10.81),在METEOR分数上提高了13.21(95%置信区间:12.05 - 14.37),在CHRF分数上提高了15.82(95%置信区间:13.89 - 16.44)。微调后的LongT5在零样本设置下的性能接近GPT-3.5。此外,有时较小的微调模型甚至比更大的零样本模型表现更优。上述改进趋势在人工评估和GPT4模拟评估中均有体现。我们的结果可用于指导针对需要特定领域知识的任务(如医学证据总结)进行模型选择。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aeeb/11451644/20f996319322/nihpp-2408.00588v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aeeb/11451644/605974f87ccd/nihpp-2408.00588v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aeeb/11451644/31c1c2b13ca8/nihpp-2408.00588v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aeeb/11451644/74c236235c20/nihpp-2408.00588v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aeeb/11451644/b2807c38bcb2/nihpp-2408.00588v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aeeb/11451644/20f996319322/nihpp-2408.00588v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aeeb/11451644/605974f87ccd/nihpp-2408.00588v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aeeb/11451644/31c1c2b13ca8/nihpp-2408.00588v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aeeb/11451644/74c236235c20/nihpp-2408.00588v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aeeb/11451644/b2807c38bcb2/nihpp-2408.00588v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aeeb/11451644/20f996319322/nihpp-2408.00588v1-f0005.jpg

相似文献

1
Closing the gap between open-source and commercial large language models for medical evidence summarization.缩小用于医学证据总结的开源和商业大语言模型之间的差距。
ArXiv. 2024 Jul 25:arXiv:2408.00588v1.
2
Closing the gap between open source and commercial large language models for medical evidence summarization.弥合用于医学证据总结的开源大型语言模型与商业大型语言模型之间的差距。
NPJ Digit Med. 2024 Sep 9;7(1):239. doi: 10.1038/s41746-024-01239-w.
3
Distilling large language models for matching patients to clinical trials.提炼大型语言模型以实现患者与临床试验的匹配。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1953-1963. doi: 10.1093/jamia/ocae073.
4
Me-LLaMA: Foundation Large Language Models for Medical Applications.Me-LLaMA:用于医学应用的基础大语言模型。
Res Sq. 2024 May 22:rs.3.rs-4240043. doi: 10.21203/rs.3.rs-4240043/v1.
5
Automated Extraction of Patient-Centered Outcomes After Breast Cancer Treatment: An Open-Source Large Language Model-Based Toolkit.基于开源大语言模型的乳腺癌治疗后患者为中心结局自动提取工具包。
JCO Clin Cancer Inform. 2024 Aug;8:e2300258. doi: 10.1200/CCI.23.00258.
6
Benchmarking Large Language Models in Evidence-Based Medicine.基于循证医学的大型语言模型基准测试。
IEEE J Biomed Health Inform. 2024 Oct 21;PP. doi: 10.1109/JBHI.2024.3483816.
7
Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: Benchmark Study.探讨大型语言模型在总结心理健康咨询会话中的功效:基准研究。
JMIR Ment Health. 2024 Jul 23;11:e57306. doi: 10.2196/57306.
8
Evaluating large language models for health-related text classification tasks with public social media data.利用公共社交媒体数据评估用于健康相关文本分类任务的大型语言模型。
J Am Med Inform Assoc. 2024 Oct 1;31(10):2181-2189. doi: 10.1093/jamia/ocae210.
9
A comprehensive evaluation of large Language models on benchmark biomedical text processing tasks.对基准生物医学文本处理任务中大型语言模型的全面评估。
Comput Biol Med. 2024 Mar;171:108189. doi: 10.1016/j.compbiomed.2024.108189. Epub 2024 Feb 20.
10
BioInstruct: instruction tuning of large language models for biomedical natural language processing.BioInstruct:用于生物医学自然语言处理的大型语言模型的指令调整。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1821-1832. doi: 10.1093/jamia/ocae122.

本文引用的文献

1
Leveraging generative AI for clinical evidence synthesis needs to ensure trustworthiness.利用生成式人工智能进行临床证据综合需要确保其可信度。
J Biomed Inform. 2024 May;153:104640. doi: 10.1016/j.jbi.2024.104640. Epub 2024 Apr 10.
2
GeneGPT: augmenting large language models with domain tools for improved access to biomedical information.GeneGPT:利用领域工具增强大型语言模型,以改善对生物医学信息的访问。
Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae075.
3
Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study.
评估 GPT-4 在医疗保健中延续种族和性别偏见的潜力:一项模型评估研究。
Lancet Digit Health. 2024 Jan;6(1):e12-e22. doi: 10.1016/S2589-7500(23)00225-X.
4
Evaluating large language models on medical evidence summarization.基于医学证据总结对大语言模型进行评估。
NPJ Digit Med. 2023 Aug 24;6(1):158. doi: 10.1038/s41746-023-00896-7.
5
Large language models encode clinical knowledge.大语言模型编码临床知识。
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.
6
AI-generated text may have a role in evidence-based medicine.人工智能生成的文本可能在循证医学中发挥作用。
Nat Med. 2023 Jul;29(7):1593-1594. doi: 10.1038/s41591-023-02366-9.
7
Sleep-like unsupervised replay reduces catastrophic forgetting in artificial neural networks.类睡眠无监督重放可减少人工神经网络中的灾难性遗忘。
Nat Commun. 2022 Dec 15;13(1):7742. doi: 10.1038/s41467-022-34938-7.
8
Generating (Factual?) Narrative Summaries of RCTs: Experiments with Neural Multi-Document Summarization.生成(真实的?)RCT 叙述性摘要:神经多文档摘要实验。
AMIA Jt Summits Transl Sci Proc. 2021 May 17;2021:605-614. eCollection 2021.
9
The PRISMA 2020 statement: an updated guideline for reporting systematic reviews.《PRISMA 2020声明:系统评价报告的更新指南》
Rev Esp Cardiol (Engl Ed). 2021 Sep;74(9):790-799. doi: 10.1016/j.rec.2021.07.010.
10
Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry.利用PROSPERO注册库的数据,分析对医学干预措施进行系统评价所需的时间和人员。
BMJ Open. 2017 Feb 27;7(2):e012545. doi: 10.1136/bmjopen-2016-012545.