基于转换器的症状识别和多语言链接方法。

Transformer-based approach for symptom recognition and multilingual linking.

机构信息

Faculty of Mathematics and Informatics, Sofia University St. Kliment Ohridski, Blvd "James Bourchier" 5, Sofia 1164, Bulgaria.

Ontotext, Ontotext, ul. "Nikola Gabrovski" 79, Sofia 1700, Bulgaria.

出版信息

Database (Oxford). 2024 Sep 10;2024. doi: 10.1093/database/baae090.

DOI:10.1093/database/baae090

PMID:39259689

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11389607/

Abstract

This paper presents a transformer-based approach for symptom Named Entity Recognition (NER) in Spanish clinical texts and multilingual entity linking on the SympTEMIST dataset. For Spanish NER, we fine tune a RoBERTa-based token-level classifier with Bidirectional Long Short-Term Memory and conditional random field layers on an augmented train set, achieving an F1 score of 0.73. Entity linking is performed via a hybrid approach with dictionaries, generating candidates from a knowledge base containing Unified Medical Language System aliases using the cross-lingual SapBERT and reranking the top candidates using GPT-3.5. The entity linking approach shows consistent results for multiple languages of 0.73 accuracy on the SympTEMIST multilingual dataset and also achieves an accuracy of 0.6123 on the Spanish entity linking task surpassing the current top score for this subtask. Database URL: https://github.com/svassileva/symptemist-multilingual-linking.

摘要

本文提出了一种基于变压器的方法，用于西班牙语临床文本中的症状命名实体识别（NER）和 SympTEMIST 数据集上的多语言实体链接。对于西班牙语 NER，我们使用基于 RoBERTa 的令牌级分类器，在增强的训练集上使用双向长短期记忆和条件随机场层进行微调，达到了 0.73 的 F1 分数。实体链接是通过一种混合方法与字典一起完成的，该方法使用跨语言 SapBERT 从包含统一医学语言系统别名的知识库中生成候选词，并使用 GPT-3.5 对顶级候选词进行重新排序。该实体链接方法在 SympTEMIST 多语言数据集上对多种语言的准确率达到了 0.73，并且在西班牙语实体链接任务上的准确率也达到了 0.6123，超过了该子任务的当前最高分。数据库 URL：https://github.com/svassileva/symptemist-multilingual-linking。