McWilliams School of Biomedical Informatics, Houston, TX, United States.
Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, United States.
J Am Med Inform Assoc. 2024 Sep 1;31(9):1812-1820. doi: 10.1093/jamia/ocad259.
The study highlights the potential of large language models, specifically GPT-3.5 and GPT-4, in processing complex clinical data and extracting meaningful information with minimal training data. By developing and refining prompt-based strategies, we can significantly enhance the models' performance, making them viable tools for clinical NER tasks and possibly reducing the reliance on extensive annotated datasets.
This study quantifies the capabilities of GPT-3.5 and GPT-4 for clinical named entity recognition (NER) tasks and proposes task-specific prompts to improve their performance.
We evaluated these models on 2 clinical NER tasks: (1) to extract medical problems, treatments, and tests from clinical notes in the MTSamples corpus, following the 2010 i2b2 concept extraction shared task, and (2) to identify nervous system disorder-related adverse events from safety reports in the vaccine adverse event reporting system (VAERS). To improve the GPT models' performance, we developed a clinical task-specific prompt framework that includes (1) baseline prompts with task description and format specification, (2) annotation guideline-based prompts, (3) error analysis-based instructions, and (4) annotated samples for few-shot learning. We assessed each prompt's effectiveness and compared the models to BioClinicalBERT.
Using baseline prompts, GPT-3.5 and GPT-4 achieved relaxed F1 scores of 0.634, 0.804 for MTSamples and 0.301, 0.593 for VAERS. Additional prompt components consistently improved model performance. When all 4 components were used, GPT-3.5 and GPT-4 achieved relaxed F1 socres of 0.794, 0.861 for MTSamples and 0.676, 0.736 for VAERS, demonstrating the effectiveness of our prompt framework. Although these results trail BioClinicalBERT (F1 of 0.901 for the MTSamples dataset and 0.802 for the VAERS), it is very promising considering few training samples are needed.
The study's findings suggest a promising direction in leveraging LLMs for clinical NER tasks. However, while the performance of GPT models improved with task-specific prompts, there's a need for further development and refinement. LLMs like GPT-4 show potential in achieving close performance to state-of-the-art models like BioClinicalBERT, but they still require careful prompt engineering and understanding of task-specific knowledge. The study also underscores the importance of evaluation schemas that accurately reflect the capabilities and performance of LLMs in clinical settings.
While direct application of GPT models to clinical NER tasks falls short of optimal performance, our task-specific prompt framework, incorporating medical knowledge and training samples, significantly enhances GPT models' feasibility for potential clinical applications.
重要性:该研究强调了大型语言模型(特别是 GPT-3.5 和 GPT-4)在处理复杂临床数据和从最小训练数据中提取有意义信息方面的潜力。通过开发和改进基于提示的策略,我们可以显著提高模型的性能,使它们成为临床命名实体识别(NER)任务的可行工具,并可能减少对广泛注释数据集的依赖。
目的:本研究量化了 GPT-3.5 和 GPT-4 进行临床 NER 任务的能力,并提出了特定于任务的提示来提高它们的性能。
材料和方法:我们在 2 个临床 NER 任务上评估了这些模型:(1)根据 2010 年 i2b2 概念提取共享任务,从 MTSamples 语料库中的临床记录中提取医疗问题、治疗和测试,以及(2)从疫苗不良事件报告系统(VAERS)中的安全报告中识别与神经系统障碍相关的不良事件。为了提高 GPT 模型的性能,我们开发了一个临床任务特定的提示框架,包括(1)带有任务描述和格式规范的基本提示,(2)基于注释指南的提示,(3)基于错误分析的说明,以及(4)用于 few-shot learning 的注释样本。我们评估了每个提示的有效性,并将模型与 BioClinicalBERT 进行了比较。
结果:使用基本提示,GPT-3.5 和 GPT-4 在 MTSamples 上的宽松 F1 得分为 0.634、0.804,在 VAERS 上的宽松 F1 得分为 0.301、0.593。附加的提示组件始终可以提高模型性能。当使用所有 4 个组件时,GPT-3.5 和 GPT-4 在 MTSamples 上的宽松 F1 得分为 0.794、0.861,在 VAERS 上的宽松 F1 得分为 0.676、0.736,证明了我们的提示框架的有效性。尽管这些结果落后于 BioClinicalBERT(MTSamples 数据集的 F1 为 0.901,VAERS 的 F1 为 0.802),但考虑到只需要很少的训练样本,这是非常有希望的。
讨论:该研究的结果表明,在临床 NER 任务中利用大型语言模型是一个很有前途的方向。然而,虽然 GPT 模型的性能通过特定于任务的提示得到了提高,但仍需要进一步的开发和改进。像 GPT-4 这样的大型语言模型显示出在接近最先进的模型(如 BioClinicalBERT)的性能方面的潜力,但它们仍然需要仔细的提示工程和对特定于任务的知识的理解。该研究还强调了需要评估方案,这些方案可以准确反映大型语言模型在临床环境中的能力和性能。
结论:虽然直接将 GPT 模型应用于临床 NER 任务的效果并不理想,但我们的特定于任务的提示框架,结合了医学知识和训练样本,显著提高了 GPT 模型在潜在临床应用中的可行性。