ISTAR, Instituto Universitário de Lisboa (ISCTE-IUL), 1649-026, Lisbon, Portugal.
Select Data, Anaheim, CA, 92807, USA.
Comput Biol Med. 2024 Nov;182:109233. doi: 10.1016/j.compbiomed.2024.109233. Epub 2024 Oct 2.
Patient medical information often exists in unstructured text containing abbreviations and acronyms deemed essential to conserve time and space but posing challenges for automated interpretation. Leveraging the efficacy of Transformers in natural language processing, our objective was to use the knowledge acquired by a language model and continue its pre-training to develop an European Portuguese (PT-PT) healthcare-domain language model.
After carrying out a filtering process, Albertina PT-PT 900M was selected as our base language model, and we continued its pre-training using more than 2.6 million electronic medical records from Portugal's largest public hospital. MediAlbertina 900M has been created through domain adaptation on this data using masked language modelling.
The comparison with our baseline was made through the usage of both perplexity, which decreased from about 20 to 1.6 values, and the fine-tuning and evaluation of information extraction models such as Named Entity Recognition and Assertion Status. MediAlbertina PT-PT outperformed Albertina PT-PT in both tasks by 4-6% on recall and f1-score.
This study contributes with the first publicly available medical language model trained with PT-PT data. It underscores the efficacy of domain adaptation and offers a contribution to the scientific community in overcoming obstacles of non-English languages. With MediAlbertina, further steps can be taken to assist physicians, in creating decision support systems or building medical timelines in order to perform profiling, by fine-tuning MediAlbertina for PT- PT medical tasks.
患者的医疗信息通常存在于非结构化文本中,其中包含了被认为是节省时间和空间所必需的缩写和首字母缩写词,但这给自动化解释带来了挑战。利用 Transformers 在自然语言处理中的功效,我们的目标是利用语言模型所获得的知识并继续对其进行预训练,以开发一种欧洲葡萄牙语(PT-PT)医疗保健领域的语言模型。
在进行过滤过程后,选择了 Albertina PT-PT 900M 作为我们的基础语言模型,并使用来自葡萄牙最大公立医院的 260 多万份电子病历继续对其进行预训练。通过在该数据上使用掩蔽语言建模进行领域适应,创建了 MediAlbertina 900M。
通过使用困惑度(从约 20 降低到 1.6 的值)以及对命名实体识别和断言状态等信息提取模型的微调进行比较,与基线相比,MediAlbertina PT-PT 在召回率和 f1 分数方面均优于 Albertina PT-PT,分别提高了 4-6%。
本研究提供了第一个使用 PT-PT 数据训练的可用的医疗语言模型。它强调了领域适应的功效,并为克服非英语语言的障碍为科学界做出了贡献。通过 MediAlbertina,可以进一步采取措施帮助医生创建决策支持系统或构建医疗时间线,以便通过针对 PT-PT 医疗任务的微调来进行分析。