Becker Matthias, Krumscheid Mario, Knobelspies Alisa, Seydel Markus, Richter-Pechanski Phillip, Karl Alexander
Department of Computer Science, University of Applied Sciences and Arts Kempten, Bahnhofstr. 61, 87435 Kempten, DE, Germany; Bavarian Center for Digital Health and Social Care, Albert-Einstein-Str. 6, 87437 Kempten, DE, Germany.
Department of Computer Science, University of Applied Sciences and Arts Kempten, Bahnhofstr. 61, 87435 Kempten, DE, Germany; Bavarian Center for Digital Health and Social Care, Albert-Einstein-Str. 6, 87437 Kempten, DE, Germany.
Int J Med Inform. 2025 Nov;203:106009. doi: 10.1016/j.ijmedinf.2025.106009. Epub 2025 Jun 6.
Cardiovascular diseases are a major cause of morbidity and mortality, and the management of these conditions generates extensive clinical data. The CARDIO:DE dataset, a German-language corpus of cardiovascular clinical routine letters, has been developed to support natural language processing research. This study seeks to enhance the dataset by introducing refined annotation guidelines and expanding the annotation schema.
The objective of this study was to extend the CARDIO:DE dataset with additional annotation categories, and evaluate state-of-the-art NLP models to enhance the utility of the dataset for clinical applications.
The annotation schema was expanded to include categories such as diagnostic procedures, medical finding, and therapeutic interventions (Diagnostic, Diagnosis, Drug, Medical_Finding, Therapy). The iterative annotation process involved expert annotators, ensuring high-quality, consistent annotations. Four models-GBERT, medBERT.de, XLM-RoBERTa, and TinyLlama-were fine-tuned and evaluated on the dataset. Model performance was assessed using entity-wise precision, recall, and F1 scores.
The extended dataset includes 304,582 token-based annotations, with the highest concentration in medical finding. The inter-annotator agreement scores improved during the iterative process, reaching up to 0.98 for certain subsets. Among the evaluated models, TinyLlama outperformed the other models in entity recognition, achieving a macro-average F1 score of 0.845, highlighting its potential for clinical NLP tasks.
The extended CARDIO:DE dataset, with its refined annotation guidelines provides a robust foundation for natural language processing applications in the clinical domain. The performance of the TinyLlama model demonstrates the potential of fine-tuning non-domain-specific models for clinical text processing. This work paves the way for more accurate NLP solutions in healthcare, particularly for information extraction and decision support in cardiology.
心血管疾病是发病和死亡的主要原因,对这些疾病的管理产生了大量临床数据。CARDIO:DE数据集是一个德语心血管临床常规信件语料库,旨在支持自然语言处理研究。本研究旨在通过引入完善的注释指南和扩展注释模式来增强该数据集。
本研究的目的是用额外的注释类别扩展CARDIO:DE数据集,并评估最先进的自然语言处理模型,以提高该数据集在临床应用中的实用性。
注释模式被扩展以包括诊断程序、医学发现和治疗干预等类别(诊断、诊断、药物、医学发现、治疗)。迭代注释过程由专家注释者参与,确保高质量、一致的注释。对四个模型——GBERT、medBERT.de、XLM-RoBERTa和TinyLlama——在数据集上进行了微调并评估。使用实体层面的精确率、召回率和F1分数评估模型性能。
扩展后的数据集包括304,582个基于词元的注释,其中医学发现类别中的注释最为集中。在迭代过程中,注释者间的一致性分数有所提高,某些子集的分数达到了0.98。在评估的模型中,TinyLlama在实体识别方面优于其他模型,宏观平均F1分数达到0.845,凸显了其在临床自然语言处理任务中的潜力。
扩展后的CARDIO:DE数据集及其完善的注释指南为临床领域的自然语言处理应用提供了坚实基础。TinyLlama模型的性能证明了对非领域特定模型进行微调以用于临床文本处理的潜力。这项工作为医疗保健领域更准确的自然语言处理解决方案铺平了道路,特别是在心脏病学的信息提取和决策支持方面。