从预训练到微调：生物医学领域大型语言模型的深入分析。

From pre-training to fine-tuning: An in-depth analysis of Large Language Models in the biomedical domain.

机构信息

Research Unit of Intelligent Technology for Health and Wellbeing, Department of Engineering, Università Campus Bio-Medico di Roma, Via Alvaro del Portillo, 21, Rome, 00128, Italy; ItaliaNLP Lab, Institute of Computational Linguistics "Antonio Zampolli", National Research Council, Via Giuseppe Moruzzi, 1, Pisa, 56124, Italy.

ItaliaNLP Lab, Institute of Computational Linguistics "Antonio Zampolli", National Research Council, Via Giuseppe Moruzzi, 1, Pisa, 56124, Italy; Research Unit of Computer Systems and Bioinformatics, Department of Engineering, Università Campus Bio-Medico di Roma, Via Alvaro del Portillo, 21, Rome, 00128, Italy.

出版信息

Artif Intell Med. 2024 Nov;157:103003. doi: 10.1016/j.artmed.2024.103003. Epub 2024 Oct 23.

DOI:10.1016/j.artmed.2024.103003

PMID:39471773

Abstract

In this study, we delve into the adaptation and effectiveness of Transformer-based, pre-trained Large Language Models (LLMs) within the biomedical domain, a field that poses unique challenges due to its complexity and the specialized nature of its data. Building on the foundation laid by the transformative architecture of Transformers, we investigate the nuanced dynamics of LLMs through a multifaceted lens, focusing on two domain-specific tasks, i.e., Natural Language Inference (NLI) and Named Entity Recognition (NER). Our objective is to bridge the knowledge gap regarding how these models' downstream performances correlate with their capacity to encapsulate task-relevant information. To achieve this goal, we probed and analyzed the inner encoding and attention mechanisms in LLMs, both encoder- and decoder-based, tailored for either general or biomedical-specific applications. This examination occurs before and after the models are fine-tuned across various data volumes. Our findings reveal that the models' downstream effectiveness is intricately linked to specific patterns within their internal mechanisms, shedding light on the nuanced ways in which LLMs process and apply knowledge in the biomedical context. The source code for this paper is available at https://github.com/agnesebonfigli99/LLMs-in-the-Biomedical-Domain.

摘要

在这项研究中，我们深入探讨了基于转换器的预训练大型语言模型（LLM）在生物医学领域的适应性和有效性，该领域由于其复杂性和数据的专业性而带来了独特的挑战。我们基于转换器的变革性架构为基础，通过多重视角研究 LLM 的细微动态，重点关注两个特定于领域的任务，即自然语言推理（NLI）和命名实体识别（NER）。我们的目标是弥合这些模型的下游性能与其封装任务相关信息的能力之间的知识差距。为了实现这一目标，我们探究和分析了针对通用或生物医学特定应用的基于编码器和基于解码器的 LLM 中的内部编码和注意力机制，无论是在微调之前还是之后，我们都在各种数据量上进行了分析。我们的研究结果表明，模型的下游效果与它们内部机制中的特定模式密切相关，这揭示了 LLM 在生物医学上下文中处理和应用知识的细微方式。本文的源代码可在 https://github.com/agnesebonfigli99/LLMs-in-the-Biomedical-Domain 获得。