将多跳医学知识融入小型语言模型以进行生物医学问答

Infusing Multi-Hop Medical Knowledge Into Smaller Language Models for Biomedical Question Answering.

作者信息

Chen Jing, Wei Zhihua, Shen Wen, Shang Rui

出版信息

IEEE J Biomed Health Inform. 2025 Mar 19;PP. doi: 10.1109/JBHI.2025.3547444.

DOI:10.1109/JBHI.2025.3547444

Abstract

MedQA-USMLE is a challenging biomedical question answering (BQA) task, as its questions typically involve multi-hop reasoning. To solve this task, BQA systems should possess substantial medical professional knowledge and strong medical reasoning capabilities. While state-of-the-art larger language models, such as Med-PaLM 2, have overcome this challenge, smaller language models (SLMs) still struggle with it. To bridge this gap, we introduces a multi-hop medical knowledge infusion (MHMKI) procedure to endow SLMs with medical reasoning capabilities. Specifically, we categorize MedQA-USMLE questions into distinct reasoning types, then create pre-training instances tailored to each type of questions with the semi-structured information and hyperlinks of Wikipedia articles. To enable SLMs to efficiently capture the multi-hop knowledge embedded in these instances, we design a reasoning chain masked language model for further pre-training of BERT models. Moreover, we transform these pre-training instances into a combined question answering dataset for intermediate fine-tuning of GPT models. We evaluate MHMKI with six SLMs (three BERT models and three GPT models) across five datasets spanning three BQA tasks. Results show that MHMKI benefits SLMs in nearly all tasks, especially those requiring multi-hop reasoning. For instance, the accuracy of MedQA-USMLE shows a significant increase of 5.3% on average.

摘要

MedQA-USMLE是一项具有挑战性的生物医学问答（BQA）任务，因为其问题通常涉及多跳推理。为了解决这项任务，BQA系统应具备丰富的医学专业知识和强大的医学推理能力。虽然诸如Med-PaLM 2等先进的大型语言模型已经克服了这一挑战，但较小的语言模型（SLMs）仍在为此苦苦挣扎。为了弥合这一差距，我们引入了一种多跳医学知识注入（MHMKI）程序，以使SLMs具备医学推理能力。具体而言，我们将MedQA-USMLE问题分类为不同的推理类型，然后利用维基百科文章的半结构化信息和超链接，针对每种类型的问题创建预训练实例。为了使SLMs能够有效地捕捉这些实例中嵌入的多跳知识，我们设计了一种推理链掩码语言模型，用于对BERT模型进行进一步的预训练。此外，我们将这些预训练实例转换为一个组合问答数据集，用于对GPT模型进行中间微调。我们在跨越三个BQA任务的五个数据集上，使用六个SLMs（三个BERT模型和三个GPT模型）对MHMKI进行了评估。结果表明，MHMKI在几乎所有任务中都对SLMs有益，尤其是那些需要多跳推理的任务。例如，MedQA-USMLE的准确率平均显著提高了5.3%。