Zhu Zhichao, Zhao Qing, Li Jianjiang, Ge Yanhu, Ding Xingjian, Gu Tao, Zou Jingchen, Lv Sirui, Wang Sheng, Yang Ji-Jiang
College of Computer Science, Beijing University of Technology, Beijing 100124, China.
Department of Anesthesiology, Beijing Anzhen Hospital, Capital Medical University, Beijing 100013, China.
Bioengineering (Basel). 2024 Sep 29;11(10):982. doi: 10.3390/bioengineering11100982.
The emergence of large language models (LLMs) has provided robust support for application tasks across various domains, such as name entity recognition (NER) in the general domain. However, due to the particularity of the medical domain, the research on understanding and improving the effectiveness of LLMs on biomedical named entity recognition (BNER) tasks remains relatively limited, especially in the context of Chinese text. In this study, we extensively evaluate several typical LLMs, including ChatGLM2-6B, GLM-130B, GPT-3.5, and GPT-4, on the Chinese BNER task by leveraging a real-world Chinese electronic medical record (EMR) dataset and a public dataset. The experimental results demonstrate the promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for Chinese BNER tasks. More importantly, instruction fine-tuning significantly enhances the performance of LLMs. The fine-tuned offline ChatGLM2-6B surpassed the performance of the task-specific model BiLSTM+CRF (BC) on the real-world dataset. The best fine-tuned model, GPT-3.5, outperforms all other LLMs on the publicly available CCKS2017 dataset, even surpassing half of the baselines; however, it still remains challenging for it to surpass the state-of-the-art task-specific models, i.e., Dictionary-guided Attention Network (DGAN). To our knowledge, this study is the first attempt to evaluate the performance of LLMs on Chinese BNER tasks, which emphasizes the prospective and transformative implications of utilizing LLMs on Chinese BNER tasks. Furthermore, we summarize our findings into a set of actionable guidelines for future researchers on how to effectively leverage LLMs to become experts in specific tasks.
大语言模型(LLMs)的出现为跨领域的应用任务提供了强大支持,比如通用领域的命名实体识别(NER)。然而,由于医学领域的特殊性,关于理解和提高大语言模型在生物医学命名实体识别(BNER)任务上的有效性的研究仍然相对有限,尤其是在中文文本的背景下。在本研究中,我们通过利用真实世界的中文电子病历(EMR)数据集和一个公共数据集,对几个典型的大语言模型进行了广泛评估,包括ChatGLM2 - 6B、GLM - 130B、GPT - 3.5和GPT - 4,用于中文BNER任务。实验结果表明,对于中文BNER任务,大语言模型在零样本和少样本提示设计下的表现有前景但有限。更重要的是,指令微调显著提高了大语言模型的性能。经过离线微调的ChatGLM2 - 6B在真实世界数据集上超过了特定任务模型BiLSTM + CRF(BC)的性能。最佳微调模型GPT - 3.5在公开可用的CCKS2017数据集上优于所有其他大语言模型,甚至超过了一半的基线模型;然而,要超越当前最先进的特定任务模型,即字典引导注意力网络(DGAN),对它来说仍然具有挑战性。据我们所知,本研究是首次评估大语言模型在中文BNER任务上的性能,强调了在中文BNER任务中利用大语言模型的前瞻性和变革性影响。此外,我们将研究结果总结为一套可操作的指南,供未来的研究人员参考,以了解如何有效利用大语言模型成为特定任务的专家。