对BC8 SympTEMIST赛道上的FRE进行分析：命名实体识别。

An analysis of FRE @ BC8 SympTEMIST track: named entity recognition.

作者信息

Martinez Ander, García-Santa Nuria

机构信息

AI & Computing Research Group, Fujitsu Research of Europe Ltd, Camino del Cerro de los Gamos, 1, Pozuelo de Alarcón, Madrid 28224, Spain.

出版信息

Database (Oxford). 2024 Sep 16;2024. doi: 10.1093/database/baae101.

DOI:10.1093/database/baae101

PMID:39283593

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11403810/

Abstract

This paper is a more in-depth analysis of the approaches used in our submission (Martínez A, García-Santa N. (2023) FRE @ BC8 SympTEMIST track: Named Entity Recognition Zenodo.) to the 'SympTEMIST' Named Entity Recognition (NER) shared subtask at 'BioCreative 2023'. We participated on the challenge submitting two systems based on a RoBERTa architecture LLM trained on Spanish-language clinical data available at 'HuggingFace' model repository. Before choosing the systems that would be submitted, we tried different combinations of the techniques described here: Conditional Random Fields and Byte-Pair Encoding dropout. In the second system we also included Sub-Subword feature based embeddings (SSW). The test set used in the challenge has now been released (López SL, Sánchez LG, Farré E et al. (2024) SympTEMIST Corpus: Gold Standard annotations for clinical symptoms, signs and findings information extraction. Zenodo), allowing us to analyze more in depth our methods, as well as measuring the impact of introducing data from CARMEN-I (Lima-López S, Farré-Maduell E, Krallinger M. (2023) CARMEN-I: Clinical Entities Annotation Guidelines in Spanish. Zenodo) corpus. Our experiments show the moderate effect of using the Sub-Subword feature based embeddings and the impact of including the symptom NER data from the CARMEN-I dataset. Database URL: https://physionet.org/content/carmen-i/1.0/.

摘要

本文是对我们提交给“BioCreative 2023”中“SympTEMIST”命名实体识别（NER）共享子任务的方法（Martínez A, García-Santa N. (2023) FRE @ BC8 SympTEMIST track: Named Entity Recognition Zenodo.）进行的更深入分析。我们参与了此次挑战，提交了两个基于在“HuggingFace”模型库中可用的西班牙语临床数据训练的RoBERTa架构语言模型的系统。在选择要提交的系统之前，我们尝试了此处描述的技术的不同组合：条件随机场和字节对编码随机失活。在第二个系统中，我们还纳入了基于子子词特征的嵌入（SSW）。挑战中使用的测试集现已发布（López SL, Sánchez LG, Farré E等人 (2024) SympTEMIST Corpus: Gold Standard annotations for clinical symptoms, signs and findings information extraction. Zenodo），这使我们能够更深入地分析我们的方法，以及衡量引入来自CARMEN-I（Lima-López S, Farré-Maduell E, Krallinger M. (2023) CARMEN-I: Clinical Entities Annotation Guidelines in Spanish. Zenodo）语料库的数据的影响。我们的实验展示了使用基于子子词特征的嵌入的适度效果以及纳入来自CARMEN-I数据集的症状NER数据的影响。数据库网址：https://physionet.org/content/carmen-i/1.0/ 。