Dietrich Jürgen, Hollstein André
Pharmaceuticals, Medical Affairs and Pharmacovigilance, Data Science and Insights, Bayer AG, Müllerstr. 178, 13353, Berlin, Germany.
Drug Saf. 2025 Mar;48(3):287-303. doi: 10.1007/s40264-024-01499-1. Epub 2024 Dec 11.
Recent artificial intelligence (AI) advances can generate human-like responses to a wide range of queries, making them a useful tool for healthcare applications. Therefore, the potential use of large language models (LLMs) in controlled environments regarding efficacy, reproducibility, and operability will be of paramount interest.
We investigated if and how GPT 3.5 and GPT 4 models can be directly used as a part of a GxP validated system and compared the performance of externally hosted GPT 3.5 and GPT 4 against LLMs, which can be hosted internally. We explored zero-shot LLM performance for named entity recognition (NER) and relation extraction tasks, investigated which LLM has the best zero-shot performance to be used potentially for generating training data proposals, evaluated the LLM performance of seven entities for medical NER in zero-shot experiments, selected one model for further performance improvement (few-shot and fine-tuning: Zephyr-7b-beta), and investigated how smaller open-source LLMs perform in contrast to GPT models and to a small fine-tuned T5 Base.
We performed reproducibility experiments to evaluate if LLMs can be used in controlled environments and utilized guided generation to use the same prompt across multiple models. Few-shot learning and quantized low rank adapter (QLoRA) fine-tuning were applied to further improve LLM performance.
We demonstrated that zero-shot GPT 4 performance is comparable with a fine-tuned T5, and Zephyr performed better than zero-shot GPT 3.5, but the recognition of product combinations such as product event combination was significantly better by using a fine-tuned T5. Although Open AI launched recently GPT versions to improve the generation of consistent output, both GPT variants failed to demonstrate reproducible results. The lack of reproducibility together with limitations of external hosted systems to keep validated systems in a state of control may affect the use of closed and proprietary models in regulated environments. However, due to the good NER performance, we recommend using GPT for creating annotation proposals for training data as a basis for fine-tuning.
近期人工智能(AI)的进展能够针对广泛的问题生成类似人类的回答,使其成为医疗保健应用中的有用工具。因此,大语言模型(LLMs)在可控环境中在功效、可重复性和可操作性方面的潜在用途将备受关注。
我们研究了GPT 3.5和GPT 4模型是否以及如何能够直接用作经过GxP验证的系统的一部分,并将外部托管的GPT 3.5和GPT 4与可在内部托管的大语言模型的性能进行了比较。我们探索了用于命名实体识别(NER)和关系提取任务的零样本大语言模型性能,研究了哪个大语言模型具有最佳的零样本性能,有可能用于生成训练数据提案,在零样本实验中评估了七个实体用于医学NER的大语言模型性能,选择了一个模型以进一步提高性能(少样本和微调:Zephyr - 7b - beta),并研究了较小的开源大语言模型与GPT模型以及小型微调的T5 Base相比的表现。
我们进行了可重复性实验,以评估大语言模型是否可在可控环境中使用,并利用引导生成在多个模型中使用相同的提示。应用少样本学习和量化低秩适配器(QLoRA)微调以进一步提高大语言模型性能。
我们证明了零样本GPT 4的性能与微调后的T5相当,并且Zephyr的表现优于零样本GPT 3.5,但使用微调后的T5对产品组合(如产品事件组合)的识别明显更好。尽管OpenAI最近推出了GPT版本以改善一致输出的生成,但两个GPT变体均未能展示出可重复的结果。缺乏可重复性以及外部托管系统在将经过验证的系统保持在受控状态方面的局限性,可能会影响在受监管环境中使用封闭和专有模型。然而,由于良好的NER性能,我们建议使用GPT为训练数据创建注释提案,作为微调的基础。