Me-LLaMA：用于医学应用的基础大语言模型。

Me-LLaMA: Foundation Large Language Models for Medical Applications.

作者信息

Xie Qianqian, Chen Qingyu, Chen Aokun, Peng Cheng, Hu Yan, Lin Fongci, Peng Xueqing, Huang Jimin, Zhang Jeffrey, Keloth Vipina, Zhou Xinyu, He Huan, Ohno-Machado Lucila, Wu Yonghui, Xu Hua, Bian Jiang

机构信息

Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, USA.

Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA.

出版信息

Res Sq. 2024 May 22:rs.3.rs-4240043. doi: 10.21203/rs.3.rs-4240043/v1.

DOI:10.21203/rs.3.rs-4240043/v1

PMID:38826372

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11142305/

Abstract

Recent advancements in large language models (LLMs) such as ChatGPT and LLaMA have hinted at their potential to revolutionize medical applications, yet their application in clinical settings often reveals limitations due to a lack of specialized training on medical-specific data. In response to this challenge, this study introduces Me-LLaMA, a novel medical LLM family that includes foundation models - Me-LLaMA 13/70B, along with their chat-enhanced versions - Me-LLaMA 13/70B-chat, developed through continual pre-training and instruction tuning of LLaMA2 using large medical datasets. Our methodology leverages a comprehensive domain-specific data suite, including a large-scale, continual pre-training dataset with 129B tokens, an instruction tuning dataset with 214k samples, and a new medical evaluation benchmark (MIBE) across six critical medical tasks with 12 datasets. Our extensive evaluation using the MIBE shows that Me-LLaMA models achieve overall better performance than existing open-source medical LLMs in zero-shot, few-shot and supervised learning abilities. With task-specific instruction tuning, Me-LLaMA models outperform ChatGPT on 7 out of 8 datasets and GPT-4 on 5 out of 8 datasets. In addition, we investigated the catastrophic forgetting problem, and our results show that Me-LLaMA models outperform other open-source medical LLMs in mitigating this issue. Me-LLaMA is one of the largest open-source medical foundation LLMs that use both biomedical and clinical data. It exhibits superior performance across both general and medical tasks compared to other open-source medical LLMs, rendering it an attractive choice for medical AI applications. We release our models, datasets, and evaluation scripts at: https://github.com/BIDS-Xu-Lab/Me-LLaMA.

摘要

诸如ChatGPT和LLaMA等大型语言模型（LLM）的最新进展暗示了它们在革新医疗应用方面的潜力，然而由于缺乏针对医学特定数据的专门训练，它们在临床环境中的应用往往显示出局限性。为应对这一挑战，本研究引入了Me-LLaMA，这是一个新颖的医学LLM家族，包括基础模型——Me-LLaMA 130亿/700亿参数，以及通过使用大型医学数据集对LLaMA2进行持续预训练和指令微调而开发的聊天增强版本——Me-LLaMA 130亿/700亿参数聊天版。我们的方法利用了一个全面的特定领域数据集套件，包括一个拥有1290亿个词元的大规模持续预训练数据集、一个拥有21.4万个样本的指令微调数据集，以及一个涵盖12个数据集的六项关键医学任务的新医学评估基准（MIBE）。我们使用MIBE进行的广泛评估表明，Me-LLaMA模型在零样本、少样本和监督学习能力方面总体上比现有的开源医学LLM表现更好。通过特定任务的指令微调，Me-LLaMA模型在8个数据集中的7个上优于ChatGPT，在8个数据集中的5个上优于GPT-4。此外，我们研究了灾难性遗忘问题，结果表明Me-LLaMA模型在缓解这一问题方面优于其他开源医学LLM。Me-LLaMA是使用生物医学和临床数据的最大的开源医学基础LLM之一。与其他开源医学LLM相比，它在一般任务和医学任务中均表现出卓越性能，使其成为医学人工智能应用的一个有吸引力的选择。我们在https://github.com/BIDS-Xu-Lab/Me-LLaMA上发布了我们的模型、数据集和评估脚本。