Fu Yujuan Velvin, Ramachandran Giridhar Kaushik, Park Namu, Lybarger Kevin, Xia Fei, Uzuner Ozlem, Yetisgen Meliha
University of Washington, Seattle, WA, USA.
George Mason University, Fairfax, VA, USA.
AMIA Jt Summits Transl Sci Proc. 2025 Jun 10;2025:149-158. eCollection 2025.
Large language models (LLMs) such as ChatGPT are fine-tuned on large and diverse instruction-following corpora, and can generalize to new tasks. However, those instruction-tuned LLMs often perform poorly in specialized medical natural language understanding (NLU) tasks that require domain knowledge, granular text comprehension, and structured data extraction. To bridge the gap, we: (1) propose a unified prompting format for 7 important NLU tasks, (2) curate an instruction-tuning dataset, MNLU-Instruct, utilizing diverse existing open-source medical NLU corpora, and (3) develop BioMistral-NLU, a generalizable medical NLU model, through fine-tuning BioMistral on MNLU-Instruct. We evaluate BioMistral-NLU in a zero-shot setting, across 6 important NLU tasks, from two widely adopted medical NLU benchmarks: BLUE and BLURB. Our experiments show that our BioMistral-NLU outperforms the original BioMistral, as well as the proprietary LLMs - ChatGPT and GPT-4. Our dataset-agnostic prompting strategy and instruction tuning step over diverse NLU tasks enhance LLMs' generalizability across diverse medical NLU tasks. Our ablation experiments show that instruction-tuning on a wider variety of tasks, even when the total number of training instances remains constant, enhances downstream zero-shot generalization.
诸如ChatGPT之类的大语言模型(LLMs)是在大量且多样的遵循指令的语料库上进行微调的,并且可以推广到新任务。然而,那些经过指令微调的LLMs在需要领域知识、细致的文本理解和结构化数据提取的专业医学自然语言理解(NLU)任务中,往往表现不佳。为了弥合这一差距,我们:(1)为7个重要的NLU任务提出了一种统一的提示格式;(2)利用各种现有的开源医学NLU语料库,精心策划了一个指令微调数据集MNLU-Instruct;(3)通过在MNLU-Instruct上对BioMistral进行微调,开发了一种可推广的医学NLU模型BioMistral-NLU。我们在零样本设置下,针对来自两个广泛采用的医学NLU基准BLUE和BLURB的6个重要NLU任务,对BioMistral-NLU进行了评估。我们的实验表明,我们的BioMistral-NLU优于原始的BioMistral,以及专有LLMs——ChatGPT和GPT-4。我们与数据集无关的提示策略以及针对不同NLU任务的指令微调步骤,增强了LLMs在各种医学NLU任务中的通用性。我们的消融实验表明,即使训练实例的总数保持不变,在更广泛的各种任务上进行指令微调也能增强下游零样本泛化能力。