• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

MediAlbertina:一个欧洲葡萄牙语医学语言模型。

MediAlbertina: An European Portuguese medical language model.

机构信息

ISTAR, Instituto Universitário de Lisboa (ISCTE-IUL), 1649-026, Lisbon, Portugal.

Select Data, Anaheim, CA, 92807, USA.

出版信息

Comput Biol Med. 2024 Nov;182:109233. doi: 10.1016/j.compbiomed.2024.109233. Epub 2024 Oct 2.

DOI:10.1016/j.compbiomed.2024.109233
PMID:39362002
Abstract

BACKGROUND

Patient medical information often exists in unstructured text containing abbreviations and acronyms deemed essential to conserve time and space but posing challenges for automated interpretation. Leveraging the efficacy of Transformers in natural language processing, our objective was to use the knowledge acquired by a language model and continue its pre-training to develop an European Portuguese (PT-PT) healthcare-domain language model.

METHODS

After carrying out a filtering process, Albertina PT-PT 900M was selected as our base language model, and we continued its pre-training using more than 2.6 million electronic medical records from Portugal's largest public hospital. MediAlbertina 900M has been created through domain adaptation on this data using masked language modelling.

RESULTS

The comparison with our baseline was made through the usage of both perplexity, which decreased from about 20 to 1.6 values, and the fine-tuning and evaluation of information extraction models such as Named Entity Recognition and Assertion Status. MediAlbertina PT-PT outperformed Albertina PT-PT in both tasks by 4-6% on recall and f1-score.

CONCLUSIONS

This study contributes with the first publicly available medical language model trained with PT-PT data. It underscores the efficacy of domain adaptation and offers a contribution to the scientific community in overcoming obstacles of non-English languages. With MediAlbertina, further steps can be taken to assist physicians, in creating decision support systems or building medical timelines in order to perform profiling, by fine-tuning MediAlbertina for PT- PT medical tasks.

摘要

背景

患者的医疗信息通常存在于非结构化文本中,其中包含了被认为是节省时间和空间所必需的缩写和首字母缩写词,但这给自动化解释带来了挑战。利用 Transformers 在自然语言处理中的功效,我们的目标是利用语言模型所获得的知识并继续对其进行预训练,以开发一种欧洲葡萄牙语(PT-PT)医疗保健领域的语言模型。

方法

在进行过滤过程后,选择了 Albertina PT-PT 900M 作为我们的基础语言模型,并使用来自葡萄牙最大公立医院的 260 多万份电子病历继续对其进行预训练。通过在该数据上使用掩蔽语言建模进行领域适应,创建了 MediAlbertina 900M。

结果

通过使用困惑度(从约 20 降低到 1.6 的值)以及对命名实体识别和断言状态等信息提取模型的微调进行比较,与基线相比,MediAlbertina PT-PT 在召回率和 f1 分数方面均优于 Albertina PT-PT,分别提高了 4-6%。

结论

本研究提供了第一个使用 PT-PT 数据训练的可用的医疗语言模型。它强调了领域适应的功效,并为克服非英语语言的障碍为科学界做出了贡献。通过 MediAlbertina,可以进一步采取措施帮助医生创建决策支持系统或构建医疗时间线,以便通过针对 PT-PT 医疗任务的微调来进行分析。

相似文献

1
MediAlbertina: An European Portuguese medical language model.MediAlbertina:一个欧洲葡萄牙语医学语言模型。
Comput Biol Med. 2024 Nov;182:109233. doi: 10.1016/j.compbiomed.2024.109233. Epub 2024 Oct 2.
2
Health Care Language Models and Their Fine-Tuning for Information Extraction: Scoping Review.医疗保健语言模型及其在信息提取方面的微调:范围综述。
JMIR Med Inform. 2024 Oct 21;12:e60164. doi: 10.2196/60164.
3
MLM-based typographical error correction of unstructured medical texts for named entity recognition.基于 MLM 的非结构化医疗文本命名实体识别的排版错误校正。
BMC Bioinformatics. 2022 Nov 16;23(1):486. doi: 10.1186/s12859-022-05035-9.
4
Comparing Different Methods for Named Entity Recognition in Portuguese Neurology Text.比较葡萄牙语神经病学文本中命名实体识别的不同方法。
J Med Syst. 2020 Feb 28;44(4):77. doi: 10.1007/s10916-020-1542-8.
5
Task definition, annotated dataset, and supervised natural language processing models for symptom extraction from unstructured clinical notes.从非结构化临床记录中提取症状的任务定义、标注数据集和监督自然语言处理模型。
J Biomed Inform. 2020 Feb;102:103354. doi: 10.1016/j.jbi.2019.103354. Epub 2019 Dec 12.
6
Med7: A transferable clinical natural language processing model for electronic health records.Med7:一种可转移的电子健康记录临床自然语言处理模型。
Artif Intell Med. 2021 Aug;118:102086. doi: 10.1016/j.artmed.2021.102086. Epub 2021 May 18.
7
Advancing Italian biomedical information extraction with transformers-based models: Methodological insights and multicenter practical application.基于转换器的模型推进意大利生物医学信息提取:方法学见解和多中心实际应用。
J Biomed Inform. 2023 Dec;148:104557. doi: 10.1016/j.jbi.2023.104557. Epub 2023 Nov 25.
8
Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing.利用基于深度学习的自然语言处理技术从非结构化电子健康记录中分类社会健康决定因素。
J Biomed Inform. 2022 Mar;127:103984. doi: 10.1016/j.jbi.2021.103984. Epub 2022 Jan 7.
9
Multifaceted Natural Language Processing Task-Based Evaluation of Bidirectional Encoder Representations From Transformers Models for Bilingual (Korean and English) Clinical Notes: Algorithm Development and Validation.基于转换器模型的双向编码器表示的多方面自然语言处理任务评估在双语(韩语和英语)临床笔记中的应用:算法开发和验证。
JMIR Med Inform. 2024 Oct 30;12:e52897. doi: 10.2196/52897.
10
CACER: Clinical concept Annotations for Cancer Events and Relations.CACER:癌症事件与关系的临床概念注释。
J Am Med Inform Assoc. 2024 Nov 1;31(11):2583-2594. doi: 10.1093/jamia/ocae231.

引用本文的文献

1
Fine-tuning of language models for automated structuring of medical exam reports to improve patient screening and analysis.对语言模型进行微调,以实现医学检查报告的自动结构化,从而改善患者筛查与分析。
Sci Rep. 2025 Jul 4;15(1):23949. doi: 10.1038/s41598-025-05695-6.