Suppr超能文献

零样本临床自然语言处理中大型语言模型提示策略的实证评估:算法开发与验证研究

An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.

作者信息

Sivarajkumar Sonish, Kelley Mark, Samolyk-Mazzanti Alyssa, Visweswaran Shyam, Wang Yanshan

机构信息

Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, United States.

Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, United States.

出版信息

JMIR Med Inform. 2024 Apr 8;12:e55318. doi: 10.2196/55318.

Abstract

BACKGROUND

Large language models (LLMs) have shown remarkable capabilities in natural language processing (NLP), especially in domains where labeled data are scarce or expensive, such as the clinical domain. However, to unlock the clinical knowledge hidden in these LLMs, we need to design effective prompts that can guide them to perform specific clinical NLP tasks without any task-specific training data. This is known as in-context learning, which is an art and science that requires understanding the strengths and weaknesses of different LLMs and prompt engineering approaches.

OBJECTIVE

The objective of this study is to assess the effectiveness of various prompt engineering techniques, including 2 newly introduced types-heuristic and ensemble prompts, for zero-shot and few-shot clinical information extraction using pretrained language models.

METHODS

This comprehensive experimental study evaluated different prompt types (simple prefix, simple cloze, chain of thought, anticipatory, heuristic, and ensemble) across 5 clinical NLP tasks: clinical sense disambiguation, biomedical evidence extraction, coreference resolution, medication status extraction, and medication attribute extraction. The performance of these prompts was assessed using 3 state-of-the-art language models: GPT-3.5 (OpenAI), Gemini (Google), and LLaMA-2 (Meta). The study contrasted zero-shot with few-shot prompting and explored the effectiveness of ensemble approaches.

RESULTS

The study revealed that task-specific prompt tailoring is vital for the high performance of LLMs for zero-shot clinical NLP. In clinical sense disambiguation, GPT-3.5 achieved an accuracy of 0.96 with heuristic prompts and 0.94 in biomedical evidence extraction. Heuristic prompts, alongside chain of thought prompts, were highly effective across tasks. Few-shot prompting improved performance in complex scenarios, and ensemble approaches capitalized on multiple prompt strengths. GPT-3.5 consistently outperformed Gemini and LLaMA-2 across tasks and prompt types.

CONCLUSIONS

This study provides a rigorous evaluation of prompt engineering methodologies and introduces innovative techniques for clinical information extraction, demonstrating the potential of in-context learning in the clinical domain. These findings offer clear guidelines for future prompt-based clinical NLP research, facilitating engagement by non-NLP experts in clinical NLP advancements. To the best of our knowledge, this is one of the first works on the empirical evaluation of different prompt engineering approaches for clinical NLP in this era of generative artificial intelligence, and we hope that it will inspire and inform future research in this area.

摘要

背景

大语言模型(LLMs)在自然语言处理(NLP)中展现出了卓越的能力,尤其是在标记数据稀缺或昂贵的领域,如临床领域。然而,要挖掘这些大语言模型中隐藏的临床知识,我们需要设计有效的提示词,引导它们在没有任何特定任务训练数据的情况下执行特定的临床NLP任务。这被称为上下文学习,它是一门艺术和科学,需要了解不同大语言模型和提示工程方法的优缺点。

目的

本研究的目的是评估各种提示工程技术的有效性,包括2种新引入的类型——启发式和集成提示词,用于使用预训练语言模型进行零样本和少样本临床信息提取。

方法

这项全面的实验研究评估了5项临床NLP任务中的不同提示词类型(简单前缀、简单完形填空、思维链、预期式、启发式和集成式):临床语义消歧、生物医学证据提取、指代消解、用药状态提取和用药属性提取。使用3种最先进的语言模型评估这些提示词的性能:GPT-3.5(OpenAI)、Gemini(谷歌)和LLaMA-2(Meta)。该研究对比了零样本提示和少样本提示,并探索了集成方法的有效性。

结果

研究表明,针对特定任务定制提示词对于大语言模型在零样本临床NLP中的高性能至关重要。在临床语义消歧中,GPT-3.5使用启发式提示词时准确率达到0.96,在生物医学证据提取中为0.94。启发式提示词与思维链提示词在各项任务中都非常有效。少样本提示在复杂场景中提高了性能,集成方法利用了多种提示词的优势。在各项任务和提示词类型中,GPT-3.5始终优于Gemini和LLaMA-2。

结论

本研究对提示工程方法进行了严格评估,并引入了用于临床信息提取的创新技术,展示了上下文学习在临床领域的潜力。这些发现为未来基于提示词的临床NLP研究提供了明确的指导方针,促进非NLP专家参与临床NLP的进展。据我们所知,这是在生成式人工智能时代对临床NLP不同提示工程方法进行实证评估的首批研究之一,我们希望它能激发并为该领域的未来研究提供参考。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/540f/11036183/be03bb1bfc0e/medinform_v12i1e55318_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验