• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过提示工程改进临床命名实体识别的大型语言模型。

Improving large language models for clinical named entity recognition via prompt engineering.

机构信息

McWilliams School of Biomedical Informatics, Houston, TX, United States.

Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, United States.

出版信息

J Am Med Inform Assoc. 2024 Sep 1;31(9):1812-1820. doi: 10.1093/jamia/ocad259.

DOI:10.1093/jamia/ocad259
PMID:38281112
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11339492/
Abstract

IMPORTANCE

The study highlights the potential of large language models, specifically GPT-3.5 and GPT-4, in processing complex clinical data and extracting meaningful information with minimal training data. By developing and refining prompt-based strategies, we can significantly enhance the models' performance, making them viable tools for clinical NER tasks and possibly reducing the reliance on extensive annotated datasets.

OBJECTIVES

This study quantifies the capabilities of GPT-3.5 and GPT-4 for clinical named entity recognition (NER) tasks and proposes task-specific prompts to improve their performance.

MATERIALS AND METHODS

We evaluated these models on 2 clinical NER tasks: (1) to extract medical problems, treatments, and tests from clinical notes in the MTSamples corpus, following the 2010 i2b2 concept extraction shared task, and (2) to identify nervous system disorder-related adverse events from safety reports in the vaccine adverse event reporting system (VAERS). To improve the GPT models' performance, we developed a clinical task-specific prompt framework that includes (1) baseline prompts with task description and format specification, (2) annotation guideline-based prompts, (3) error analysis-based instructions, and (4) annotated samples for few-shot learning. We assessed each prompt's effectiveness and compared the models to BioClinicalBERT.

RESULTS

Using baseline prompts, GPT-3.5 and GPT-4 achieved relaxed F1 scores of 0.634, 0.804 for MTSamples and 0.301, 0.593 for VAERS. Additional prompt components consistently improved model performance. When all 4 components were used, GPT-3.5 and GPT-4 achieved relaxed F1 socres of 0.794, 0.861 for MTSamples and 0.676, 0.736 for VAERS, demonstrating the effectiveness of our prompt framework. Although these results trail BioClinicalBERT (F1 of 0.901 for the MTSamples dataset and 0.802 for the VAERS), it is very promising considering few training samples are needed.

DISCUSSION

The study's findings suggest a promising direction in leveraging LLMs for clinical NER tasks. However, while the performance of GPT models improved with task-specific prompts, there's a need for further development and refinement. LLMs like GPT-4 show potential in achieving close performance to state-of-the-art models like BioClinicalBERT, but they still require careful prompt engineering and understanding of task-specific knowledge. The study also underscores the importance of evaluation schemas that accurately reflect the capabilities and performance of LLMs in clinical settings.

CONCLUSION

While direct application of GPT models to clinical NER tasks falls short of optimal performance, our task-specific prompt framework, incorporating medical knowledge and training samples, significantly enhances GPT models' feasibility for potential clinical applications.

摘要

重要性:该研究强调了大型语言模型(特别是 GPT-3.5 和 GPT-4)在处理复杂临床数据和从最小训练数据中提取有意义信息方面的潜力。通过开发和改进基于提示的策略,我们可以显著提高模型的性能,使它们成为临床命名实体识别(NER)任务的可行工具,并可能减少对广泛注释数据集的依赖。

目的:本研究量化了 GPT-3.5 和 GPT-4 进行临床 NER 任务的能力,并提出了特定于任务的提示来提高它们的性能。

材料和方法:我们在 2 个临床 NER 任务上评估了这些模型:(1)根据 2010 年 i2b2 概念提取共享任务,从 MTSamples 语料库中的临床记录中提取医疗问题、治疗和测试,以及(2)从疫苗不良事件报告系统(VAERS)中的安全报告中识别与神经系统障碍相关的不良事件。为了提高 GPT 模型的性能,我们开发了一个临床任务特定的提示框架,包括(1)带有任务描述和格式规范的基本提示,(2)基于注释指南的提示,(3)基于错误分析的说明,以及(4)用于 few-shot learning 的注释样本。我们评估了每个提示的有效性,并将模型与 BioClinicalBERT 进行了比较。

结果:使用基本提示,GPT-3.5 和 GPT-4 在 MTSamples 上的宽松 F1 得分为 0.634、0.804,在 VAERS 上的宽松 F1 得分为 0.301、0.593。附加的提示组件始终可以提高模型性能。当使用所有 4 个组件时,GPT-3.5 和 GPT-4 在 MTSamples 上的宽松 F1 得分为 0.794、0.861,在 VAERS 上的宽松 F1 得分为 0.676、0.736,证明了我们的提示框架的有效性。尽管这些结果落后于 BioClinicalBERT(MTSamples 数据集的 F1 为 0.901,VAERS 的 F1 为 0.802),但考虑到只需要很少的训练样本,这是非常有希望的。

讨论:该研究的结果表明,在临床 NER 任务中利用大型语言模型是一个很有前途的方向。然而,虽然 GPT 模型的性能通过特定于任务的提示得到了提高,但仍需要进一步的开发和改进。像 GPT-4 这样的大型语言模型显示出在接近最先进的模型(如 BioClinicalBERT)的性能方面的潜力,但它们仍然需要仔细的提示工程和对特定于任务的知识的理解。该研究还强调了需要评估方案,这些方案可以准确反映大型语言模型在临床环境中的能力和性能。

结论:虽然直接将 GPT 模型应用于临床 NER 任务的效果并不理想,但我们的特定于任务的提示框架,结合了医学知识和训练样本,显著提高了 GPT 模型在潜在临床应用中的可行性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6617/11339492/babf104ee5f0/ocad259f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6617/11339492/de1a61e558d1/ocad259f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6617/11339492/bff2c4079187/ocad259f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6617/11339492/03cbdbca2764/ocad259f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6617/11339492/df29b0e85207/ocad259f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6617/11339492/babf104ee5f0/ocad259f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6617/11339492/de1a61e558d1/ocad259f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6617/11339492/bff2c4079187/ocad259f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6617/11339492/03cbdbca2764/ocad259f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6617/11339492/df29b0e85207/ocad259f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6617/11339492/babf104ee5f0/ocad259f5.jpg

相似文献

1
Improving large language models for clinical named entity recognition via prompt engineering.通过提示工程改进临床命名实体识别的大型语言模型。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1812-1820. doi: 10.1093/jamia/ocad259.
2
Improving entity recognition using ensembles of deep learning and fine-tuned large language models: A case study on adverse event extraction from VAERS and social media.使用深度学习集成和微调大语言模型改进实体识别:以从VAERS和社交媒体中提取不良事件为例
J Biomed Inform. 2025 Mar;163:104789. doi: 10.1016/j.jbi.2025.104789. Epub 2025 Feb 7.
3
Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study.利用合成医疗保健数据借助大语言模型进行命名实体识别:开发与验证研究。
J Med Internet Res. 2025 Mar 18;27:e66279. doi: 10.2196/66279.
4
Prompt Framework for Extracting Scale-Related Knowledge Entities from Chinese Medical Literature: Development and Evaluation Study.从中医文献中提取量表相关知识实体的提示框架:开发与评估研究
J Med Internet Res. 2025 Mar 18;27:e67033. doi: 10.2196/67033.
5
An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.零样本临床自然语言处理中大型语言模型提示策略的实证评估:算法开发与验证研究
JMIR Med Inform. 2024 Apr 8;12:e55318. doi: 10.2196/55318.
6
Performance and Reproducibility of Large Language Models in Named Entity Recognition: Considerations for the Use in Controlled Environments.大型语言模型在命名实体识别中的性能与可重复性:在受控环境中使用的考量
Drug Saf. 2025 Mar;48(3):287-303. doi: 10.1007/s40264-024-01499-1. Epub 2024 Dec 11.
7
A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports.基于大语言模型的零样本推理与乳腺癌病理报告任务特定监督分类的比较研究。
J Am Med Inform Assoc. 2024 Oct 1;31(10):2315-2327. doi: 10.1093/jamia/ocae146.
8
An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontology-Enhanced Large Language Models: Development Study.基于本体增强大语言模型的罕见病知识图谱构建自动端到端系统:开发研究
JMIR Med Inform. 2024 Dec 18;12:e60665. doi: 10.2196/60665.
9
Evaluation of the Performance of a Large Language Model to Extract Signs and Symptoms from Clinical Notes.评估大型语言模型从临床记录中提取体征和症状的性能。
Stud Health Technol Inform. 2025 Apr 8;323:71-75. doi: 10.3233/SHTI250051.
10
Relation extraction using large language models: a case study on acupuncture point locations.基于大语言模型的关系抽取研究:以穴位定位为例。
J Am Med Inform Assoc. 2024 Nov 1;31(11):2622-2631. doi: 10.1093/jamia/ocae233.

引用本文的文献

1
Exploring the use of large language models for classification, clinical interpretation, and treatment recommendation in breast tumor patient records.探索大语言模型在乳腺肿瘤患者记录的分类、临床解读及治疗推荐中的应用。
Sci Rep. 2025 Aug 26;15(1):31450. doi: 10.1038/s41598-025-16999-y.
2
Symptom Recognition in Medical Conversations Via multi- Instance Learning and Prompt.通过多实例学习和提示实现医学对话中的症状识别
J Med Syst. 2025 Aug 20;49(1):107. doi: 10.1007/s10916-025-02240-w.
3
Large Language Models for Adverse Drug Events: A Clinical Perspective.

本文引用的文献

1
GeneGPT: augmenting large language models with domain tools for improved access to biomedical information.GeneGPT:利用领域工具增强大型语言模型,以改善对生物医学信息的访问。
Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae075.
2
Opportunities and challenges for ChatGPT and large language models in biomedicine and health.ChatGPT 和大型语言模型在生物医学和健康领域的机遇与挑战。
Brief Bioinform. 2023 Nov 22;25(1). doi: 10.1093/bib/bbad493.
3
ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports.
用于药物不良事件的大语言模型:临床视角
J Clin Med. 2025 Aug 4;14(15):5490. doi: 10.3390/jcm14155490.
4
Do LLMs Surpass Encoders for Biomedical NER?大型语言模型在生物医学命名实体识别方面是否超越了编码器?
Proc (IEEE Int Conf Healthc Inform). 2025 Jun;2025:352-358. doi: 10.1109/ICHI64645.2025.00048. Epub 2025 Jul 22.
5
Multi-Label Classification with Generative AI Models in Healthcare: A Case Study of Suicidality and Risk Factors.医疗保健领域中基于生成式人工智能模型的多标签分类:自杀倾向及风险因素的案例研究
ArXiv. 2025 Jul 22:arXiv:2507.17009v1.
6
A Large-Language Model Framework for Relative Timeline Extraction from PubMed Case Reports.一种用于从PubMed病例报告中提取相对时间线的大语言模型框架。
ArXiv. 2025 Apr 15:arXiv:2504.12350v1.
7
Accuracy of Large Language Models to Identify Stroke Subtypes Within Unstructured Electronic Health Record Data.大语言模型在非结构化电子健康记录数据中识别中风亚型的准确性。
Stroke. 2025 Jul 25. doi: 10.1161/STROKEAHA.125.051993.
8
Advancing named entity recognition in interprofessional collaboration and education.推进跨专业合作与教育中的命名实体识别。
Front Med (Lausanne). 2025 Jun 26;12:1578769. doi: 10.3389/fmed.2025.1578769. eCollection 2025.
9
Toward Cross-Hospital Deployment of Natural Language Processing Systems: Model Development and Validation of Fine-Tuned Large Language Models for Disease Name Recognition in Japanese.迈向自然语言处理系统的跨医院部署:用于日语疾病名称识别的微调大语言模型的模型开发与验证
JMIR Med Inform. 2025 Jul 8;13:e76773. doi: 10.2196/76773.
10
Academic case reports lack diversity: Assessing the presence and diversity of sociodemographic and behavioral factors related to Post COVID-19 Condition.学术病例报告缺乏多样性:评估与新冠后状况相关的社会人口学和行为因素的存在情况及多样性。
PLoS One. 2025 Jul 2;20(7):e0326668. doi: 10.1371/journal.pone.0326668. eCollection 2025.
ChatGPT 让医学文献通俗易懂:简化放射学报告的探索性案例研究。
Eur Radiol. 2024 May;34(5):2817-2825. doi: 10.1007/s00330-023-10213-1. Epub 2023 Oct 5.
4
Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot.评估 GPT 作为放射学决策辅助工具:GPT-4 与 GPT-3.5 在乳腺成像试点中的比较。
J Am Coll Radiol. 2023 Oct;20(10):990-997. doi: 10.1016/j.jacr.2023.05.003. Epub 2023 Jun 21.
5
Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings.评估ChatGPT在眼科领域的表现:对其优缺点的分析。
Ophthalmol Sci. 2023 May 5;3(4):100324. doi: 10.1016/j.xops.2023.100324. eCollection 2023 Dec.
6
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现:使用大语言模型进行人工智能辅助医学教育的潜力。
PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.
7
How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.ChatGPT在美国医师执照考试(USMLE)中的表现如何?大语言模型对医学教育和知识评估的影响。
JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.
8
Extracting postmarketing adverse events from safety reports in the vaccine adverse event reporting system (VAERS) using deep learning.使用深度学习从疫苗不良事件报告系统(VAERS)中的安全报告中提取上市后不良事件。
J Am Med Inform Assoc. 2021 Jul 14;28(7):1393-1400. doi: 10.1093/jamia/ocab014.
9
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
10
Clinical Natural Language Processing in languages other than English: opportunities and challenges.非英语语言的临床自然语言处理:机遇与挑战。
J Biomed Semantics. 2018 Mar 30;9(1):12. doi: 10.1186/s13326-018-0179-8.