González-Moreno Ana, Ramos-González Alberto, González-Carrasco Israel, Alonso Díaz de Durana M Dolores, Sellers Gutiérrez-Argumosa Beatriz, Moncada Salinero Alicia, Pastor-Magro Ana Belén, González-Piñeiro Beatriz, Tejedor-Alonso Miguel A, Martínez Paloma
Allergy Unit, Hospital Universitario Fundación Alcorcón, C. Budapest, 1, Alcorcón, 28922, Madrid, Spain.
Computer Science and Engineering Department, Universidad Carlos III de Madrid, Av. Universidad, 30, Leganés, 28911, Madrid, Spain.
Sci Data. 2025 Jan 29;12(1):173. doi: 10.1038/s41597-025-04503-0.
This article describes a dataset on nut allergy extracted from Spanish clinical records provided by the Hospital Universitario Fundación de Alcorcón (HUFA) in Madrid, Spain, in collaboration with its Allergology Unit and Information Systems and Technologies Department. There are few publicly available clinical texts in Spanish and having more is essential as a valuable resource to train and test information extraction systems. In total, 828 clinical notes in Spanish were employed and several experts participated in the annotation process by categorizing the annotated entities into medical semantic groups related to allergies. To evaluate inter-annotator agreement, a triple annotation was performed on 8% of the texts. The guidelines followed to create the corpus are also provided. To determine the validation of the corpus and introduce a real use case, we performed some experiments using this resource in the context of a supervised named entity recognition (NER) task by fine-tuning encoder-based transformers. In these experiments, an average F-measure of 86.2% was achieved. These results indicate that the corpus used is suitable for training and testing approaches to NER related to the field of allergology.
本文介绍了一个从西班牙马德里阿尔科孔基金会大学医院(HUFA)提供的临床记录中提取的坚果过敏数据集,该数据集是与该医院的过敏科以及信息系统与技术部合作完成的。西班牙公开可用的临床文本很少,而拥有更多此类文本对于训练和测试信息提取系统来说是非常宝贵的资源。总共使用了828篇西班牙语临床记录,几位专家参与了注释过程,将注释实体分类到与过敏相关的医学语义组中。为了评估注释者之间的一致性,对8%的文本进行了三重注释。还提供了创建语料库所遵循的指导方针。为了确定语料库的有效性并引入一个实际用例,我们在基于编码器的变压器微调的监督命名实体识别(NER)任务的背景下,使用此资源进行了一些实验。在这些实验中,平均F值达到了86.2%。这些结果表明,所使用的语料库适用于训练和测试与过敏学领域相关的NER方法。