对基准生物医学文本处理任务中大型语言模型的全面评估。

A comprehensive evaluation of large Language models on benchmark biomedical text processing tasks.

机构信息

Department of Biology, York University, Canada; Information Retrieval and Knowledge Management Research Lab, York University, Canada.

School of Information Technology, York University, Canada; Information Retrieval and Knowledge Management Research Lab, York University, Canada; Dialpad Inc., Canada.

出版信息

Comput Biol Med. 2024 Mar;171:108189. doi: 10.1016/j.compbiomed.2024.108189. Epub 2024 Feb 20.

DOI:10.1016/j.compbiomed.2024.108189

PMID:38447502

Abstract

Recently, Large Language Models (LLMs) have demonstrated impressive capability to solve a wide range of tasks. However, despite their success across various tasks, no prior work has investigated their capability in the biomedical domain yet. To this end, this paper aims to evaluate the performance of LLMs on benchmark biomedical tasks. For this purpose, a comprehensive evaluation of 4 popular LLMs in 6 diverse biomedical tasks across 26 datasets has been conducted. To the best of our knowledge, this is the first work that conducts an extensive evaluation and comparison of various LLMs in the biomedical domain. Interestingly, we find based on our evaluation that in biomedical datasets that have smaller training sets, zero-shot LLMs even outperform the current state-of-the-art models when they were fine-tuned only on the training set of these datasets. This suggests that pre-training on large text corpora makes LLMs quite specialized even in the biomedical domain. We also find that not a single LLM can outperform other LLMs in all tasks, with the performance of different LLMs may vary depending on the task. While their performance is still quite poor in comparison to the biomedical models that were fine-tuned on large training sets, our findings demonstrate that LLMs have the potential to be a valuable tool for various biomedical tasks that lack large annotated data.

摘要

最近，大型语言模型 (LLM) 已经证明了在解决广泛任务方面的令人印象深刻的能力。然而，尽管它们在各种任务中取得了成功，但之前没有任何工作研究过它们在生物医学领域的能力。为此，本文旨在评估 LLM 在基准生物医学任务上的性能。为此，我们对 4 种流行的 LLM 在 6 种不同的生物医学任务和 26 个数据集上进行了全面评估。据我们所知，这是首次在生物医学领域对各种 LLM 进行广泛评估和比较的工作。有趣的是，我们根据评估结果发现，在训练集较小的生物医学数据集中，零样本 LLM 在仅对这些数据集的训练集进行微调时，甚至超过了当前最先进的模型。这表明在大型文本语料库上进行预训练使 LLM 即使在生物医学领域也变得非常专业化。我们还发现，没有一个 LLM 在所有任务中都能胜过其他 LLM，不同 LLM 的性能可能因任务而异。虽然与在大型训练集上进行微调的生物医学模型相比，它们的性能仍然相当差，但我们的研究结果表明，LLM 有可能成为缺乏大型标注数据的各种生物医学任务的有价值的工具。