零样本临床自然语言处理中大型语言模型提示策略的实证评估：算法开发与验证研究

An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.

作者信息

Sivarajkumar Sonish, Kelley Mark, Samolyk-Mazzanti Alyssa, Visweswaran Shyam, Wang Yanshan

机构信息

Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, United States.

Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, United States.

出版信息

JMIR Med Inform. 2024 Apr 8;12:e55318. doi: 10.2196/55318.

DOI:10.2196/55318

PMID:38587879

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11036183/

Abstract

BACKGROUND

Large language models (LLMs) have shown remarkable capabilities in natural language processing (NLP), especially in domains where labeled data are scarce or expensive, such as the clinical domain. However, to unlock the clinical knowledge hidden in these LLMs, we need to design effective prompts that can guide them to perform specific clinical NLP tasks without any task-specific training data. This is known as in-context learning, which is an art and science that requires understanding the strengths and weaknesses of different LLMs and prompt engineering approaches.

OBJECTIVE

The objective of this study is to assess the effectiveness of various prompt engineering techniques, including 2 newly introduced types-heuristic and ensemble prompts, for zero-shot and few-shot clinical information extraction using pretrained language models.

METHODS

This comprehensive experimental study evaluated different prompt types (simple prefix, simple cloze, chain of thought, anticipatory, heuristic, and ensemble) across 5 clinical NLP tasks: clinical sense disambiguation, biomedical evidence extraction, coreference resolution, medication status extraction, and medication attribute extraction. The performance of these prompts was assessed using 3 state-of-the-art language models: GPT-3.5 (OpenAI), Gemini (Google), and LLaMA-2 (Meta). The study contrasted zero-shot with few-shot prompting and explored the effectiveness of ensemble approaches.

RESULTS

The study revealed that task-specific prompt tailoring is vital for the high performance of LLMs for zero-shot clinical NLP. In clinical sense disambiguation, GPT-3.5 achieved an accuracy of 0.96 with heuristic prompts and 0.94 in biomedical evidence extraction. Heuristic prompts, alongside chain of thought prompts, were highly effective across tasks. Few-shot prompting improved performance in complex scenarios, and ensemble approaches capitalized on multiple prompt strengths. GPT-3.5 consistently outperformed Gemini and LLaMA-2 across tasks and prompt types.

CONCLUSIONS

This study provides a rigorous evaluation of prompt engineering methodologies and introduces innovative techniques for clinical information extraction, demonstrating the potential of in-context learning in the clinical domain. These findings offer clear guidelines for future prompt-based clinical NLP research, facilitating engagement by non-NLP experts in clinical NLP advancements. To the best of our knowledge, this is one of the first works on the empirical evaluation of different prompt engineering approaches for clinical NLP in this era of generative artificial intelligence, and we hope that it will inspire and inform future research in this area.

摘要

背景

大语言模型（LLMs）在自然语言处理（NLP）中展现出了卓越的能力，尤其是在标记数据稀缺或昂贵的领域，如临床领域。然而，要挖掘这些大语言模型中隐藏的临床知识，我们需要设计有效的提示词，引导它们在没有任何特定任务训练数据的情况下执行特定的临床NLP任务。这被称为上下文学习，它是一门艺术和科学，需要了解不同大语言模型和提示工程方法的优缺点。

目的

本研究的目的是评估各种提示工程技术的有效性，包括2种新引入的类型——启发式和集成提示词，用于使用预训练语言模型进行零样本和少样本临床信息提取。

方法

这项全面的实验研究评估了5项临床NLP任务中的不同提示词类型（简单前缀、简单完形填空、思维链、预期式、启发式和集成式）：临床语义消歧、生物医学证据提取、指代消解、用药状态提取和用药属性提取。使用3种最先进的语言模型评估这些提示词的性能：GPT-3.5（OpenAI）、Gemini（谷歌）和LLaMA-2（Meta）。该研究对比了零样本提示和少样本提示，并探索了集成方法的有效性。

结果

研究表明，针对特定任务定制提示词对于大语言模型在零样本临床NLP中的高性能至关重要。在临床语义消歧中，GPT-3.5使用启发式提示词时准确率达到0.96，在生物医学证据提取中为0.94。启发式提示词与思维链提示词在各项任务中都非常有效。少样本提示在复杂场景中提高了性能，集成方法利用了多种提示词的优势。在各项任务和提示词类型中，GPT-3.5始终优于Gemini和LLaMA-2。

结论

本研究对提示工程方法进行了严格评估，并引入了用于临床信息提取的创新技术，展示了上下文学习在临床领域的潜力。这些发现为未来基于提示词的临床NLP研究提供了明确的指导方针，促进非NLP专家参与临床NLP的进展。据我们所知，这是在生成式人工智能时代对临床NLP不同提示工程方法进行实证评估的首批研究之一，我们希望它能激发并为该领域的未来研究提供参考。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/540f/11036183/be03bb1bfc0e/medinform_v12i1e55318_fig1.jpg

相似文献

An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.

JMIR Med Inform. 2024 Apr 8;12:e55318. doi: 10.2196/55318.

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.

JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.

Few-Shot Learning for Clinical Natural Language Processing Using Siamese Neural Networks: Algorithm Development and Validation Study.

JMIR AI. 2023 May 4;2:e44293. doi: 10.2196/44293.

Improving large language models for clinical named entity recognition via prompt engineering.

J Am Med Inform Assoc. 2024 Sep 1;31(9):1812-1820. doi: 10.1093/jamia/ocad259.

Improving the use of LLMs in radiology through prompt engineering: from precision prompts to zero-shot learning.

Rofo. 2024 Nov;196(11):1166-1170. doi: 10.1055/a-2264-5631. Epub 2024 Feb 26.

Large language models for biomedicine: foundations, opportunities, challenges, and best practices.

J Am Med Inform Assoc. 2024 Sep 1;31(9):2114-2124. doi: 10.1093/jamia/ocae074.

Model tuning or prompt Tuning? a study of large language models for clinical concept and relation extraction.

J Biomed Inform. 2024 May;153:104630. doi: 10.1016/j.jbi.2024.104630. Epub 2024 Mar 26.

Extraction of Substance Use Information From Clinical Notes: Generative Pretrained Transformer-Based Investigation.

JMIR Med Inform. 2024 Aug 19;12:e56243. doi: 10.2196/56243.

Generative large language models are all-purpose text analytics engines: text-to-text learning is all your need.

J Am Med Inform Assoc. 2024 Sep 1;31(9):1892-1903. doi: 10.1093/jamia/ocae078.

Evaluating large language models for health-related text classification tasks with public social media data.

J Am Med Inform Assoc. 2024 Oct 1;31(10):2181-2189. doi: 10.1093/jamia/ocae210.

引用本文的文献

From data silos to insights: the PRINCE multi-agent knowledge engine for preclinical drug development.

Front Artif Intell. 2025 Aug 19;8:1636809. doi: 10.3389/frai.2025.1636809. eCollection 2025.

Symptom Recognition in Medical Conversations Via multi- Instance Learning and Prompt.

J Med Syst. 2025 Aug 20;49(1):107. doi: 10.1007/s10916-025-02240-w.

Efficient Detection of Stigmatizing Language in Electronic Health Records via In-Context Learning: Comparative Analysis and Validation Study.

JMIR Med Inform. 2025 Aug 18;13:e68955. doi: 10.2196/68955.

Evaluating a Chatbot as a Companion for Patients With Breast Cancer: Collaborative Pilot Study.

JMIR Cancer. 2025 Aug 13;11:e68426. doi: 10.2196/68426.

Evaluating acute image ordering for real-world patient cases via language model alignment with radiological guidelines.

Commun Med (Lond). 2025 Aug 4;5(1):332. doi: 10.1038/s43856-025-01061-9.

Using Open-Source Large Language Models to Identify Access to Germline Genetic Testing in Veterans With Breast Cancer From Unstructured Text.

JCO Clin Cancer Inform. 2025 Jul;9:e2400263. doi: 10.1200/CCI-24-00263. Epub 2025 Jul 22.

Leveraging Large Language Models for Accurate Retrieval of Patient Information From Medical Reports: Systematic Evaluation Study.

JMIR AI. 2025 Jul 3;4:e68776. doi: 10.2196/68776.

Dynamic few-shot prompting for clinical note section classification using lightweight, open-source large language models.

J Am Med Inform Assoc. 2025 Jul 1;32(7):1164-1173. doi: 10.1093/jamia/ocaf084.

Iterative refinement and goal articulation to optimize large language models for clinical information extraction.

NPJ Digit Med. 2025 May 23;8(1):301. doi: 10.1038/s41746-025-01686-z.

Designing Personalized Multimodal Mnemonics With AI: A Medical Student's Implementation Tutorial.

JMIR Med Educ. 2025 May 8;11:e67926. doi: 10.2196/67926.

本文引用的文献

Improving large language models for clinical named entity recognition via prompt engineering.

J Am Med Inform Assoc. 2024 Sep 1;31(9):1812-1820. doi: 10.1093/jamia/ocad259.

HealthPrompt: A Zero-shot Learning Paradigm for Clinical Natural Language Processing.

AMIA Annu Symp Proc. 2023 Apr 29;2022:972-981. eCollection 2022.

Information extraction from electronic medical documents: state of the art and future research directions.

Knowl Inf Syst. 2023;65(2):463-516. doi: 10.1007/s10115-022-01779-1. Epub 2022 Nov 8.

Zero-Shot Clinical Acronym Expansion via Latent Meaning Cells.

Proc Mach Learn Res. 2020 Dec;136:12-40.

Machine Learning Approaches to Retrieve High-Quality, Clinically Relevant Evidence From the Biomedical Literature: Systematic Review.

JMIR Med Inform. 2021 Sep 9;9(9):e30401. doi: 10.2196/30401.

Detecting formal thought disorder by deep contextualized word representations.

Psychiatry Res. 2021 Oct;304:114135. doi: 10.1016/j.psychres.2021.114135. Epub 2021 Jul 24.

Overview of the First Natural Language Processing Challenge for Extracting Medication, Indication, and Adverse Drug Events from Electronic Health Record Notes (MADE 1.0).

Drug Saf. 2019 Jan;42(1):99-111. doi: 10.1007/s40264-018-0762-z.

A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature.

Proc Conf Assoc Comput Linguist Meet. 2018 Jul;2018:197-207.

Clinical information extraction applications: A literature review.

J Biomed Inform. 2018 Jan;77:34-49. doi: 10.1016/j.jbi.2017.11.011. Epub 2017 Nov 21.

A sense inventory for clinical abbreviations and acronyms created using clinical notes and medical dictionary resources.

J Am Med Inform Assoc. 2014 Mar-Apr;21(2):299-307. doi: 10.1136/amiajnl-2012-001506. Epub 2013 Jun 27.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

零样本临床自然语言处理中大型语言模型提示策略的实证评估：算法开发与验证研究

An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献