• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

扩展CARDIO:DE:临床应用的附加注释指南及自然语言处理方法评估

Extending CARDIO:DE: Additional annotation guidelines and evaluation of NLP approaches for clinical applications.

作者信息

Becker Matthias, Krumscheid Mario, Knobelspies Alisa, Seydel Markus, Richter-Pechanski Phillip, Karl Alexander

机构信息

Department of Computer Science, University of Applied Sciences and Arts Kempten, Bahnhofstr. 61, 87435 Kempten, DE, Germany; Bavarian Center for Digital Health and Social Care, Albert-Einstein-Str. 6, 87437 Kempten, DE, Germany.

Department of Computer Science, University of Applied Sciences and Arts Kempten, Bahnhofstr. 61, 87435 Kempten, DE, Germany; Bavarian Center for Digital Health and Social Care, Albert-Einstein-Str. 6, 87437 Kempten, DE, Germany.

出版信息

Int J Med Inform. 2025 Nov;203:106009. doi: 10.1016/j.ijmedinf.2025.106009. Epub 2025 Jun 6.

DOI:10.1016/j.ijmedinf.2025.106009
PMID:40513382
Abstract

BACKGROUND

Cardiovascular diseases are a major cause of morbidity and mortality, and the management of these conditions generates extensive clinical data. The CARDIO:DE dataset, a German-language corpus of cardiovascular clinical routine letters, has been developed to support natural language processing research. This study seeks to enhance the dataset by introducing refined annotation guidelines and expanding the annotation schema.

OBJECTIVE

The objective of this study was to extend the CARDIO:DE dataset with additional annotation categories, and evaluate state-of-the-art NLP models to enhance the utility of the dataset for clinical applications.

METHODS

The annotation schema was expanded to include categories such as diagnostic procedures, medical finding, and therapeutic interventions (Diagnostic, Diagnosis, Drug, Medical_Finding, Therapy). The iterative annotation process involved expert annotators, ensuring high-quality, consistent annotations. Four models-GBERT, medBERT.de, XLM-RoBERTa, and TinyLlama-were fine-tuned and evaluated on the dataset. Model performance was assessed using entity-wise precision, recall, and F1 scores.

RESULTS

The extended dataset includes 304,582 token-based annotations, with the highest concentration in medical finding. The inter-annotator agreement scores improved during the iterative process, reaching up to 0.98 for certain subsets. Among the evaluated models, TinyLlama outperformed the other models in entity recognition, achieving a macro-average F1 score of 0.845, highlighting its potential for clinical NLP tasks.

CONCLUSIONS

The extended CARDIO:DE dataset, with its refined annotation guidelines provides a robust foundation for natural language processing applications in the clinical domain. The performance of the TinyLlama model demonstrates the potential of fine-tuning non-domain-specific models for clinical text processing. This work paves the way for more accurate NLP solutions in healthcare, particularly for information extraction and decision support in cardiology.

摘要

背景

心血管疾病是发病和死亡的主要原因,对这些疾病的管理产生了大量临床数据。CARDIO:DE数据集是一个德语心血管临床常规信件语料库,旨在支持自然语言处理研究。本研究旨在通过引入完善的注释指南和扩展注释模式来增强该数据集。

目的

本研究的目的是用额外的注释类别扩展CARDIO:DE数据集,并评估最先进的自然语言处理模型,以提高该数据集在临床应用中的实用性。

方法

注释模式被扩展以包括诊断程序、医学发现和治疗干预等类别(诊断、诊断、药物、医学发现、治疗)。迭代注释过程由专家注释者参与,确保高质量、一致的注释。对四个模型——GBERT、medBERT.de、XLM-RoBERTa和TinyLlama——在数据集上进行了微调并评估。使用实体层面的精确率、召回率和F1分数评估模型性能。

结果

扩展后的数据集包括304,582个基于词元的注释,其中医学发现类别中的注释最为集中。在迭代过程中,注释者间的一致性分数有所提高,某些子集的分数达到了0.98。在评估的模型中,TinyLlama在实体识别方面优于其他模型,宏观平均F1分数达到0.845,凸显了其在临床自然语言处理任务中的潜力。

结论

扩展后的CARDIO:DE数据集及其完善的注释指南为临床领域的自然语言处理应用提供了坚实基础。TinyLlama模型的性能证明了对非领域特定模型进行微调以用于临床文本处理的潜力。这项工作为医疗保健领域更准确的自然语言处理解决方案铺平了道路,特别是在心脏病学的信息提取和决策支持方面。

相似文献

1
Extending CARDIO:DE: Additional annotation guidelines and evaluation of NLP approaches for clinical applications.扩展CARDIO:DE:临床应用的附加注释指南及自然语言处理方法评估
Int J Med Inform. 2025 Nov;203:106009. doi: 10.1016/j.ijmedinf.2025.106009. Epub 2025 Jun 6.
2
Toward Cross-Hospital Deployment of Natural Language Processing Systems: Model Development and Validation of Fine-Tuned Large Language Models for Disease Name Recognition in Japanese.迈向自然语言处理系统的跨医院部署:用于日语疾病名称识别的微调大语言模型的模型开发与验证
JMIR Med Inform. 2025 Jul 8;13:e76773. doi: 10.2196/76773.
3
Extraction of sleep information from clinical notes of Alzheimer's disease patients using natural language processing.使用自然语言处理从阿尔茨海默病患者的临床记录中提取睡眠信息。
J Am Med Inform Assoc. 2024 Oct 1;31(10):2217-2227. doi: 10.1093/jamia/ocae177.
4
Identify diabetic retinopathy-related clinical concepts and their attributes using transformer-based natural language processing methods.使用基于转换器的自然语言处理方法识别与糖尿病视网膜病变相关的临床概念及其属性。
BMC Med Inform Decis Mak. 2022 Sep 27;22(Suppl 3):255. doi: 10.1186/s12911-022-01996-2.
5
Multicriteria Optimization of Language Models for Heart Failure With Preserved Ejection Fraction Symptom Detection in Spanish Electronic Health Records: Comparative Modeling Study.西班牙电子健康记录中射血分数保留的心力衰竭症状检测语言模型的多标准优化:比较建模研究
J Med Internet Res. 2025 Jul 17;27:e76433. doi: 10.2196/76433.
6
Harnessing Moderate-Sized Language Models for Reliable Patient Data Deidentification in Emergency Department Records: Algorithm Development, Validation, and Implementation Study.利用中等规模语言模型对急诊科记录中的患者数据进行可靠去识别:算法开发、验证与实施研究。
JMIR AI. 2025 Apr 1;4:e57828. doi: 10.2196/57828.
7
Developing an ICD-10 Coding Assistant: Pilot Study Using RoBERTa and GPT-4 for Term Extraction and Description-Based Code Selection.开发国际疾病分类第十版(ICD - 10)编码助手:使用RoBERTa和GPT - 4进行术语提取和基于描述的代码选择的试点研究
JMIR Form Res. 2025 Feb 11;9:e60095. doi: 10.2196/60095.
8
Cross-institutional dental electronic health record entity extraction via generative artificial intelligence and synthetic notes.通过生成式人工智能和合成笔记进行跨机构牙科电子健康记录实体提取
JAMIA Open. 2025 Jun 28;8(3):ooaf061. doi: 10.1093/jamiaopen/ooaf061. eCollection 2025 Jun.
9
De-identification of clinical free text using natural language processing: A systematic review of current approaches.使用自然语言处理对临床自由文本进行去识别化:当前方法的系统评价。
Artif Intell Med. 2024 May;151:102845. doi: 10.1016/j.artmed.2024.102845. Epub 2024 Mar 20.
10
Development of a Natural Language Processing Model for Extracting Kidney Biopsy Pathology Diagnoses.用于提取肾活检病理诊断的自然语言处理模型的开发
Kidney Med. 2025 Jun 14;7(8):101047. doi: 10.1016/j.xkme.2025.101047. eCollection 2025 Aug.