MediAlbertina：一个欧洲葡萄牙语医学语言模型。

MediAlbertina: An European Portuguese medical language model.

机构信息

ISTAR, Instituto Universitário de Lisboa (ISCTE-IUL), 1649-026, Lisbon, Portugal.

Select Data, Anaheim, CA, 92807, USA.

出版信息

Comput Biol Med. 2024 Nov;182:109233. doi: 10.1016/j.compbiomed.2024.109233. Epub 2024 Oct 2.

DOI:10.1016/j.compbiomed.2024.109233

PMID:39362002

Abstract

BACKGROUND

Patient medical information often exists in unstructured text containing abbreviations and acronyms deemed essential to conserve time and space but posing challenges for automated interpretation. Leveraging the efficacy of Transformers in natural language processing, our objective was to use the knowledge acquired by a language model and continue its pre-training to develop an European Portuguese (PT-PT) healthcare-domain language model.

METHODS

After carrying out a filtering process, Albertina PT-PT 900M was selected as our base language model, and we continued its pre-training using more than 2.6 million electronic medical records from Portugal's largest public hospital. MediAlbertina 900M has been created through domain adaptation on this data using masked language modelling.

RESULTS

The comparison with our baseline was made through the usage of both perplexity, which decreased from about 20 to 1.6 values, and the fine-tuning and evaluation of information extraction models such as Named Entity Recognition and Assertion Status. MediAlbertina PT-PT outperformed Albertina PT-PT in both tasks by 4-6% on recall and f1-score.

CONCLUSIONS

This study contributes with the first publicly available medical language model trained with PT-PT data. It underscores the efficacy of domain adaptation and offers a contribution to the scientific community in overcoming obstacles of non-English languages. With MediAlbertina, further steps can be taken to assist physicians, in creating decision support systems or building medical timelines in order to perform profiling, by fine-tuning MediAlbertina for PT- PT medical tasks.

摘要

背景

患者的医疗信息通常存在于非结构化文本中，其中包含了被认为是节省时间和空间所必需的缩写和首字母缩写词，但这给自动化解释带来了挑战。利用 Transformers 在自然语言处理中的功效，我们的目标是利用语言模型所获得的知识并继续对其进行预训练，以开发一种欧洲葡萄牙语（PT-PT）医疗保健领域的语言模型。