Suppr超能文献

迈向医学多语言语言模型的构建。

Towards building multilingual language model for medicine.

机构信息

Shanghai Jiao Tong University, Shanghai, China.

Shanghai AI Laboratory, Shanghai, China.

出版信息

Nat Commun. 2024 Sep 27;15(1):8384. doi: 10.1038/s41467-024-52417-z.

Abstract

The development of open-source, multilingual medical language models can benefit a wide, linguistically diverse audience from different regions. To promote this domain, we present contributions from the following: First, we construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages, termed as MMedC, enabling auto-regressive domain adaptation for general LLMs; Second, to monitor the development of multilingual medical LLMs, we propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench; Third, we have assessed a number of open-source large language models (LLMs) on our benchmark, along with those further auto-regressive trained on MMedC. Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks, even rivaling GPT-4. In conclusion, in this work, We present a large-scale corpus, a benchmark and a series of models to support the development of multilingual medical LLMs.

摘要

开源、多语言医学语言模型的发展可以使来自不同地区的广泛的、语言多样化的受众受益。为了促进这一领域的发展,我们提出了以下贡献:首先,我们构建了一个多语言医学语料库,包含大约 255 亿个包含 6 种主要语言的令牌,称为 MMedC,能够实现通用大语言模型的自回归领域自适应;其次,为了监测多语言医学大语言模型的发展,我们提出了一个带有推理的多语言医学多项选择问答基准,称为 MMedBench;第三,我们在基准上评估了一些开源的大语言模型(LLMs),以及那些在 MMedC 上进一步自回归训练的模型。我们的最终模型 MMed-Llama 3 只有 80 亿个参数,在 MMedBench 和英语基准上的表现都优于所有其他开源模型,甚至可以与 GPT-4 相媲美。总之,在这项工作中,我们提出了一个大规模语料库、一个基准和一系列模型,以支持多语言医学大语言模型的发展。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5375/11436924/4d6289a09496/41467_2024_52417_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验