一个用于医学实体识别和匿名化的西班牙语和加泰罗尼亚语去识别化健康记录文本数据集。

A textual dataset of de-identified health records in Spanish and Catalan for medical entity recognition and anonymization.

作者信息

Lima-López Salvador, Farré-Maduell Eulàlia, Gasco Luis, Rodríguez-Miret Jan, Frid Santiago, Pastor Xavier, Borrat Xavier, Krallinger Martin

机构信息

NLP for Biomedical Information Analysis Unit, Barcelona Supercomputing Center, Barcelona, 08034, Spain.

Clinical Informatics, Hospital Clinic, Barcelona, 08036, Spain.

出版信息

Sci Data. 2025 Jul 1;12(1):1088. doi: 10.1038/s41597-025-05320-1.

DOI:10.1038/s41597-025-05320-1

PMID:40593799

Abstract

The advancement of clinical natural language processing systems is crucial to exploit the wealth of textual data contained in medical records. Diverse data sources are required in different languages and from different sites to represent global health services. To this end, we have released CARMEN-I, a corpus of anonymized clinical records from the Hospital Clinic of Barcelona written during the COVID-19 pandemic spanning a period of two years. In addition to COVID-19 cases of adult patients, CARMEN-I features multiple comorbidities such as cardiovascular conditions, oncology treatments, post-transplant complications, and infectious diseases. This resource is publicly accessible together with detailed annotation guidelines and granular text-bound annotations generated in a collaborative effort between clinicians, linguists, and engineers to enable training and evaluation of automatic anonymization systems. Moreover, for information extraction purposes, a subset of 500 records is annotated with six relevant clinical concept classes: diseases, symptoms, procedures, medications, pathogens and humans.

摘要

临床自然语言处理系统的发展对于利用病历中丰富的文本数据至关重要。为了代表全球卫生服务，需要来自不同语言和不同地点的多样数据源。为此，我们发布了CARMEN-I，这是一个来自巴塞罗那医院诊所的匿名临床记录语料库，记录时间跨越两年的新冠疫情期间。除了成年患者的新冠病例外，CARMEN-I还包含多种合并症，如心血管疾病、肿瘤治疗、移植后并发症和传染病。该资源可公开获取，同时还提供详细的注释指南以及临床医生、语言学家和工程师共同协作生成的细粒度文本绑定注释，以支持自动匿名化系统的训练和评估。此外，为了信息提取的目的，对500条记录的子集进行了六种相关临床概念类别的注释：疾病、症状、程序、药物、病原体和人类。

相似文献

A textual dataset of de-identified health records in Spanish and Catalan for medical entity recognition and anonymization.一个用于医学实体识别和匿名化的西班牙语和加泰罗尼亚语去识别化健康记录文本数据集。

Sci Data. 2025 Jul 1;12(1):1088. doi: 10.1038/s41597-025-05320-1.

MedPromptExtract (Medical Data Extraction Tool): Anonymization and High-Fidelity Automated Data Extraction Using Natural Language Processing and Prompt Engineering.MedPromptExtract（医学数据提取工具）：使用自然语言处理和提示工程进行匿名化和高保真自动数据提取。

J Appl Lab Med. 2025 Jul 1;10(4):793-805. doi: 10.1093/jalm/jfaf034.

The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study.生成式预训练变换器4（GPT-4）分析三种不同语言医学笔记的潜力：一项回顾性模型评估研究。

Lancet Digit Health. 2025 Jan;7(1):e35-e43. doi: 10.1016/S2589-7500(24)00246-2.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Risk of thromboembolism in patients with COVID-19 who are using hormonal contraception.COVID-19 患者使用激素避孕的血栓栓塞风险。

Cochrane Database Syst Rev. 2023 Jan 9;1(1):CD014908. doi: 10.1002/14651858.CD014908.pub2.

Physical interventions to interrupt or reduce the spread of respiratory viruses.物理干预措施以阻断或减少呼吸道病毒的传播。

Cochrane Database Syst Rev. 2023 Jan 30;1(1):CD006207. doi: 10.1002/14651858.CD006207.pub6.

PDF Entity Annotation Tool (PEAT).PDF实体注释工具（PEAT）。

J Open Source Softw. 2025 Apr 8;10(108):5336. doi: 10.21105/joss.05336.

Community and hospital-based healthcare professionals perceptions of digital advance care planning for palliative and end-of-life care: a latent class analysis.社区和医院的医疗保健专业人员对姑息治疗和临终关怀的数字预立医疗计划的看法：一项潜在类别分析。

Health Soc Care Deliv Res. 2025 Jun 25:1-22. doi: 10.3310/XCGE3294.

Pharmacological treatments in panic disorder in adults: a network meta-analysis.成人惊恐障碍的药物治疗：网络荟萃分析。

Cochrane Database Syst Rev. 2023 Nov 28;11(11):CD012729. doi: 10.1002/14651858.CD012729.pub3.

Antibody tests for identification of current and past infection with SARS-CoV-2.抗体检测用于鉴定 SARS-CoV-2 的现症感染和既往感染。

Cochrane Database Syst Rev. 2022 Nov 17;11(11):CD013652. doi: 10.1002/14651858.CD013652.pub2.

本文引用的文献

A comparative analysis of Spanish Clinical encoder-based models on NER and classification tasks.基于西班牙临床编码器的命名实体识别和分类任务的比较分析。

J Am Med Inform Assoc. 2024 Sep 1;31(9):2137-2146. doi: 10.1093/jamia/ocae054.

A guide to sharing open healthcare data under the General Data Protection Regulation.《通用数据保护条例》下开放医疗保健数据共享指南。

Sci Data. 2023 Jun 24;10(1):404. doi: 10.1038/s41597-023-02256-2.

CARES: A Corpus for classification of Spanish Radiological reports.CARES：西班牙语放射学报告分类语料库。

Comput Biol Med. 2023 Mar;154:106581. doi: 10.1016/j.compbiomed.2023.106581. Epub 2023 Jan 23.

Automated deidentification of radiology reports combining transformer and "hide in plain sight" rule-based methods.基于 Transformer 和“隐藏在明处”规则的放射学报告自动去识别化。

J Am Med Inform Assoc. 2023 Jan 18;30(2):318-328. doi: 10.1093/jamia/ocac219.

CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain.CLIN-X：用于临床领域概念提取的预训练语言模型和跨任务迁移研究。

Bioinformatics. 2022 Jun 13;38(12):3267-3274. doi: 10.1093/bioinformatics/btac297.

De-identifying Spanish medical texts - named entity recognition applied to radiology reports.去识别西班牙语医学文本 - 命名实体识别在放射学报告中的应用。

J Biomed Semantics. 2021 Mar 29;12(1):6. doi: 10.1186/s13326-021-00236-2.

Disruptive and avoidable: GDPR challenges to secondary research uses of data.破坏性行为与可避免性：GDPR 对数据二次研究使用的挑战。

Eur J Hum Genet. 2020 Jun;28(6):697-705. doi: 10.1038/s41431-020-0596-x. Epub 2020 Mar 2.

Generation of Surrogates for De-Identification of Electronic Health Records.用于电子健康记录去识别化的替代物生成

Stud Health Technol Inform. 2019 Aug 21;264:70-73. doi: 10.3233/SHTI190185.

PharmacoNER Tagger: a deep learning-based tool for automatically finding chemicals and drugs in Spanish medical texts.药物命名实体识别标注器：一种基于深度学习的工具，用于在西班牙语医学文本中自动查找化学物质和药物。

Genomics Inform. 2019 Jun;17(2):e15. doi: 10.5808/GI.2019.17.2.e15. Epub 2019 Jun 19.

Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1.用于纵向临床记录去识别化的自动化系统：2014年i2b2/德克萨斯大学健康科学中心共享任务赛道1概述

J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S11-S19. doi: 10.1016/j.jbi.2015.06.007. Epub 2015 Jul 28.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一个用于医学实体识别和匿名化的西班牙语和加泰罗尼亚语去识别化健康记录文本数据集。

A textual dataset of de-identified health records in Spanish and Catalan for medical entity recognition and anonymization.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献