• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用GPT-3总结、简化和综合医学证据(效果各异)。

Summarizing, Simplifying, and Synthesizing Medical Evidence Using GPT-3 (with Varying Success).

作者信息

Shaib Chantal, Li Millicent L, Joseph Sebastian, Marshall Iain J, Li Junyi Jessy, Wallace Byron C

机构信息

Northeastern University.

The University of Texas at Austin.

出版信息

Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:1387-1407. doi: 10.18653/v1/2023.acl-short.119.

DOI:10.18653/v1/2023.acl-short.119
PMID:39629494
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11613457/
Abstract

Large language models, particularly GPT-3, are able to produce high quality summaries of general domain news articles in few- and zero-shot settings However, it is unclear if such models are similarly capable in more specialized, high-stakes domains such as biomedicine. In this paper, we enlist domain experts (individuals with medical training) to evaluate summaries of biomedical articles generated by GPT-3, given zero supervision. We consider both single- and multi-document settings. In the former, GPT-3 is tasked with generating regular and plain-language summaries of articles describing randomized controlled trials; in the latter, we assess the degree to which GPT-3 is able to evidence reported across a collection of articles. We design an annotation scheme for evaluating model outputs, with an emphasis on assessing the factual accuracy of generated summaries. We find that while GPT-3 is able to summarize and simplify single biomedical articles faithfully, it struggles to provide accurate aggregations of findings over multiple documents. We release all data and annotations used in this work.

摘要

大型语言模型,尤其是GPT-3,能够在少样本和零样本设置下生成高质量的通用领域新闻文章摘要。然而,尚不清楚此类模型在生物医学等更专业、高风险的领域是否同样适用。在本文中,我们邀请领域专家(接受过医学培训的人员)在零监督的情况下评估GPT-3生成的生物医学文章摘要。我们考虑了单文档和多文档设置。在前一种情况下,GPT-3的任务是生成描述随机对照试验的文章的常规和通俗易懂的摘要;在后一种情况下,我们评估GPT-3能够在多大程度上整合一组文章中报告的证据。我们设计了一种注释方案来评估模型输出,重点是评估生成摘要的事实准确性。我们发现,虽然GPT-3能够如实地总结和简化单篇生物医学文章,但它难以对多篇文档的研究结果进行准确汇总。我们发布了这项工作中使用的所有数据和注释。

相似文献

1
Summarizing, Simplifying, and Synthesizing Medical Evidence Using GPT-3 (with Varying Success).使用GPT-3总结、简化和综合医学证据(效果各异)。
Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:1387-1407. doi: 10.18653/v1/2023.acl-short.119.
2
Summarizing Online Patient Conversations Using Generative Language Models: Experimental and Comparative Study.使用生成式语言模型总结在线患者对话:实验与比较研究
JMIR Med Inform. 2025 Apr 14;13:e62909. doi: 10.2196/62909.
3
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
4
Evaluating Large Language Models for Drafting Emergency Department Discharge Summaries.评估用于起草急诊科出院小结的大语言模型。
medRxiv. 2024 Apr 4:2024.04.03.24305088. doi: 10.1101/2024.04.03.24305088.
5
The unreasonable effectiveness of large language models in zero-shot semantic annotation of legal texts.大语言模型在法律文本零样本语义标注方面的不合理有效性。
Front Artif Intell. 2023 Nov 17;6:1279794. doi: 10.3389/frai.2023.1279794. eCollection 2023.
6
Evaluating Large Language Models on Medical Evidence Summarization.基于医学证据总结对大语言模型进行评估。
medRxiv. 2023 Apr 24:2023.04.22.23288967. doi: 10.1101/2023.04.22.23288967.
7
Evaluating large language models on medical evidence summarization.基于医学证据总结对大语言模型进行评估。
NPJ Digit Med. 2023 Aug 24;6(1):158. doi: 10.1038/s41746-023-00896-7.
8
Exploring the opportunities of large language models for summarizing palliative care consultations: A pilot comparative study.探索大语言模型在总结姑息治疗会诊方面的机会:一项试点对比研究。
Digit Health. 2024 Nov 20;10:20552076241293932. doi: 10.1177/20552076241293932. eCollection 2024 Jan-Dec.
9
Performance and Reproducibility of Large Language Models in Named Entity Recognition: Considerations for the Use in Controlled Environments.大型语言模型在命名实体识别中的性能与可重复性:在受控环境中使用的考量
Drug Saf. 2025 Mar;48(3):287-303. doi: 10.1007/s40264-024-01499-1. Epub 2024 Dec 11.
10
An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.零样本临床自然语言处理中大型语言模型提示策略的实证评估:算法开发与验证研究
JMIR Med Inform. 2024 Apr 8;12:e55318. doi: 10.2196/55318.

引用本文的文献

1
Scalable Scientific Interest Profiling Using Large Language Models.使用大语言模型进行可扩展的科学兴趣剖析
ArXiv. 2025 Aug 19:arXiv:2508.15834v1.
2
Accelerating clinical evidence synthesis with large language models.利用大语言模型加速临床证据综合分析
NPJ Digit Med. 2025 Aug 8;8(1):509. doi: 10.1038/s41746-025-01840-7.
3
A foundation model for human-AI collaboration in medical literature mining.医学文献挖掘中人类与人工智能协作的基础模型。
ArXiv. 2025 Jan 27:arXiv:2501.16255v1.
4
ChatGPT-4o Compared With Human Researchers in Writing Plain-Language Summaries for Cochrane Reviews: A Blinded, Randomized Non-Inferiority Controlled Trial.ChatGPT-4o与人类研究人员在为Cochrane系统评价撰写通俗易懂的总结方面的比较:一项双盲、随机非劣效性对照试验。
Cochrane Evid Synth Methods. 2025 Jul 28;3(4):e70037. doi: 10.1002/cesm.70037. eCollection 2025 Jul.
5
A perspective for adapting generalist AI to specialized medical AI applications and their challenges.将通用人工智能应用于专业医学人工智能应用的前景及其挑战。
NPJ Digit Med. 2025 Jul 11;8(1):429. doi: 10.1038/s41746-025-01789-7.
6
Evaluating generative AI for qualitative data extraction in community-based fisheries management literature.评估生成式人工智能在基于社区的渔业管理文献中提取定性数据的应用。
Environ Evid. 2025 Jun 2;14(1):9. doi: 10.1186/s13750-025-00362-9.
7
Utilizing Large language models to select literature for meta-analysis shows workload reduction while maintaining a similar recall level as manual curation.利用大语言模型为荟萃分析选择文献显示,在保持与人工筛选相似召回率的同时,工作量有所减少。
BMC Med Res Methodol. 2025 Apr 28;25(1):116. doi: 10.1186/s12874-025-02569-3.
8
LitSumm: large language models for literature summarization of noncoding RNAs.文献摘要:用于非编码RNA文献摘要的大语言模型。
Database (Oxford). 2025 Feb 5;2025. doi: 10.1093/database/baaf006.
9
Evaluation and practical application of prompt-driven ChatGPTs for EMR generation.用于电子病历生成的提示驱动型ChatGPT的评估与实际应用
NPJ Digit Med. 2025 Feb 2;8(1):77. doi: 10.1038/s41746-025-01472-x.
10
Demystifying Large Language Models for Medicine: A Primer.揭开医学领域大语言模型的神秘面纱:入门指南。
ArXiv. 2024 Nov 20:arXiv:2410.18856v3.

本文引用的文献

1
Can large language models reason about medical questions?大型语言模型能对医学问题进行推理吗?
Patterns (N Y). 2024 Mar 1;5(3):100943. doi: 10.1016/j.patter.2024.100943. eCollection 2024 Mar 8.
2
Trialstreamer: A living, automatically updated database of clinical trial reports.Trialstreamer:一个实时更新的临床试验报告数据库。
J Am Med Inform Assoc. 2020 Dec 9;27(12):1903-1912. doi: 10.1093/jamia/ocaa163.
3
The well-built clinical question: a key to evidence-based decisions.构建完善的临床问题:循证决策的关键。
ACP J Club. 1995 Nov-Dec;123(3):A12-3.