Lima-López Salvador, Farré-Maduell Eulàlia, Gasco Luis, Rodríguez-Miret Jan, Frid Santiago, Pastor Xavier, Borrat Xavier, Krallinger Martin
NLP for Biomedical Information Analysis Unit, Barcelona Supercomputing Center, Barcelona, 08034, Spain.
Clinical Informatics, Hospital Clinic, Barcelona, 08036, Spain.
Sci Data. 2025 Jul 1;12(1):1088. doi: 10.1038/s41597-025-05320-1.
The advancement of clinical natural language processing systems is crucial to exploit the wealth of textual data contained in medical records. Diverse data sources are required in different languages and from different sites to represent global health services. To this end, we have released CARMEN-I, a corpus of anonymized clinical records from the Hospital Clinic of Barcelona written during the COVID-19 pandemic spanning a period of two years. In addition to COVID-19 cases of adult patients, CARMEN-I features multiple comorbidities such as cardiovascular conditions, oncology treatments, post-transplant complications, and infectious diseases. This resource is publicly accessible together with detailed annotation guidelines and granular text-bound annotations generated in a collaborative effort between clinicians, linguists, and engineers to enable training and evaluation of automatic anonymization systems. Moreover, for information extraction purposes, a subset of 500 records is annotated with six relevant clinical concept classes: diseases, symptoms, procedures, medications, pathogens and humans.
临床自然语言处理系统的发展对于利用病历中丰富的文本数据至关重要。为了代表全球卫生服务,需要来自不同语言和不同地点的多样数据源。为此,我们发布了CARMEN-I,这是一个来自巴塞罗那医院诊所的匿名临床记录语料库,记录时间跨越两年的新冠疫情期间。除了成年患者的新冠病例外,CARMEN-I还包含多种合并症,如心血管疾病、肿瘤治疗、移植后并发症和传染病。该资源可公开获取,同时还提供详细的注释指南以及临床医生、语言学家和工程师共同协作生成的细粒度文本绑定注释,以支持自动匿名化系统的训练和评估。此外,为了信息提取的目的,对500条记录的子集进行了六种相关临床概念类别的注释:疾病、症状、程序、药物、病原体和人类。