Suppr超能文献

基于多头条件随机场分类器的西班牙语临床文档中生物医学多类命名实体识别。

Multi-head CRF classifier for biomedical multi-class named entity recognition on Spanish clinical notes.

机构信息

IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal.

出版信息

Database (Oxford). 2024 Jul 30;2024. doi: 10.1093/database/baae068.

Abstract

The identification of medical concepts from clinical narratives has a large interest in the biomedical scientific community due to its importance in treatment improvements or drug development research. Biomedical named entity recognition (NER) in clinical texts is crucial for automated information extraction, facilitating patient record analysis, drug development, and medical research. Traditional approaches often focus on single-class NER tasks, yet recent advancements emphasize the necessity of addressing multi-class scenarios, particularly in complex biomedical domains. This paper proposes a strategy to integrate a multi-head conditional random field (CRF) classifier for multi-class NER in Spanish clinical documents. Our methodology overcomes overlapping entity instances of different types, a common challenge in traditional NER methodologies, by using a multi-head CRF model. This architecture enhances computational efficiency and ensures scalability for multi-class NER tasks, maintaining high performance. By combining four diverse datasets, SympTEMIST, MedProcNER, DisTEMIST, and PharmaCoNER, we expand the scope of NER to encompass five classes: symptoms, procedures, diseases, chemicals, and proteins. To the best of our knowledge, these datasets combined create the largest Spanish multi-class dataset focusing on biomedical entity recognition and linking for clinical notes, which is important to train a biomedical model in Spanish. We also provide entity linking to the multi-lingual Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) vocabulary, with the eventual goal of performing biomedical relation extraction. Through experimentation and evaluation of Spanish clinical documents, our strategy provides competitive results against single-class NER models. For NER, our system achieves a combined micro-averaged F1-score of 78.73, with clinical mentions normalized to SNOMED CT with an end-to-end F1-score of 54.51. The code to run our system is publicly available at https://github.com/ieeta-pt/Multi-Head-CRF. Database URL: https://github.com/ieeta-pt/Multi-Head-CRF.

摘要

从临床叙述中识别医学概念在生物医学科学界引起了广泛关注,因为它对治疗改进或药物开发研究具有重要意义。生物医学命名实体识别 (NER) 在临床文本中对于自动信息提取至关重要,有助于患者记录分析、药物开发和医学研究。传统方法通常侧重于单类 NER 任务,但最近的进展强调需要解决多类场景,特别是在复杂的生物医学领域。本文提出了一种策略,用于将多头条件随机场 (CRF) 分类器集成到西班牙语临床文档的多类 NER 中。我们的方法通过使用多头 CRF 模型克服了不同类型的实体实例重叠的问题,这是传统 NER 方法中的一个常见挑战。这种架构提高了计算效率,并确保了多类 NER 任务的可扩展性,同时保持了高性能。通过结合四个不同的数据集,SympTEMIST、MedProcNER、DisTEMIST 和 PharmaCoNER,我们将 NER 的范围扩展到涵盖五个类别:症状、程序、疾病、化学物质和蛋白质。据我们所知,这些数据集的组合创建了最大的西班牙语多类数据集,重点关注生物医学实体识别和临床笔记的链接,这对于在西班牙语中训练生物医学模型很重要。我们还提供实体链接到多语言医学术语系统命名法 (SNOMED CT) 词汇表,最终目标是执行生物医学关系提取。通过对西班牙语临床文档的实验和评估,我们的策略提供了与单类 NER 模型竞争的结果。对于 NER,我们的系统在多类 F1 平均得分为 78.73,而临床提及到 SNOMED CT 的归一化得分为 54.51。运行我们系统的代码可在 https://github.com/ieeta-pt/Multi-Head-CRF 上获得。数据库 URL:https://github.com/ieeta-pt/Multi-Head-CRF。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/db87/11290360/ed7f22a6f08f/baae068f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验