使用预训练语言模型从德语出院小结中自动提取12个心血管概念。

Automatic extraction of 12 cardiovascular concepts from German discharge letters using pre-trained language models.

作者信息

Richter-Pechanski Phillip, Geis Nicolas A, Kiriakou Christina, Schwab Dominic M, Dieterich Christoph

机构信息

Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, Germany.

Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, Germany.

出版信息

Digit Health. 2021 Nov 26;7:20552076211057662. doi: 10.1177/20552076211057662. eCollection 2021 Jan-Dec.

DOI:10.1177/20552076211057662

PMID:34868618

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8637713/

Abstract

OBJECTIVE

A vast amount of medical data is still stored in unstructured text documents. We present an automated method of information extraction from German unstructured clinical routine data from the cardiology domain enabling their usage in state-of-the-art data-driven deep learning projects.

METHODS

We evaluated pre-trained language models to extract a set of 12 cardiovascular concepts in German discharge letters. We compared three bidirectional encoder representations from transformers pre-trained on different corpora and fine-tuned them on the task of cardiovascular concept extraction using 204 discharge letters manually annotated by cardiologists at the University Hospital Heidelberg. We compared our results with traditional machine learning methods based on a long short-term memory network and a conditional random field.

RESULTS

Our best performing model, based on publicly available German pre-trained bidirectional encoder representations from the transformer model, achieved a token-wise micro-average F1-score of 86% and outperformed the baseline by at least 6%. Moreover, this approach achieved the best trade-off between precision (positive predictive value) and recall (sensitivity).

CONCLUSION

Our results show the applicability of state-of-the-art deep learning methods using pre-trained language models for the task of cardiovascular concept extraction using limited training data. This minimizes annotation efforts, which are currently the bottleneck of any application of data-driven deep learning projects in the clinical domain for German and many other European languages.

摘要

目的

大量医学数据仍存储在非结构化文本文件中。我们提出了一种从心脏病学领域的德语非结构化临床常规数据中自动提取信息的方法，以使这些数据能够用于最新的数据驱动深度学习项目。

方法

我们评估了预训练语言模型，以从德语出院小结中提取一组12个心血管概念。我们比较了在不同语料库上预训练并在心血管概念提取任务上进行微调的三种基于变换器的双向编码器表示，使用海德堡大学医院心脏病专家手动注释的204份出院小结进行微调。我们将结果与基于长短期记忆网络和条件随机场的传统机器学习方法进行了比较。

结果

我们表现最佳的模型基于公开可用的德语预训练变换器模型双向编码器表示，实现了逐词微平均F1分数为86%，比基线至少高出6%。此外，该方法在精度（阳性预测值）和召回率（敏感性）之间实现了最佳平衡。

结论

我们的结果表明，使用预训练语言模型的最新深度学习方法适用于使用有限训练数据进行心血管概念提取的任务。这最大限度地减少了注释工作，而注释工作目前是德语和许多其他欧洲语言在临床领域数据驱动深度学习项目的任何应用的瓶颈。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7222/8637713/6a9399fea5da/10.1177_20552076211057662-fig1.jpg

相似文献

Automatic extraction of 12 cardiovascular concepts from German discharge letters using pre-trained language models.使用预训练语言模型从德语出院小结中自动提取12个心血管概念。

Digit Health. 2021 Nov 26;7:20552076211057662. doi: 10.1177/20552076211057662. eCollection 2021 Jan-Dec.

Extracting comprehensive clinical information for breast cancer using deep learning methods.利用深度学习方法提取乳腺癌全面临床信息。

Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2.

Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing.利用基于深度学习的自然语言处理技术从非结构化电子健康记录中分类社会健康决定因素。

J Biomed Inform. 2022 Mar;127:103984. doi: 10.1016/j.jbi.2021.103984. Epub 2022 Jan 7.

Identifying Patient Populations in Texts Describing Drug Approvals Through Deep Learning-Based Information Extraction: Development of a Natural Language Processing Algorithm.通过基于深度学习的信息提取在描述药物批准的文本中识别患者群体：一种自然语言处理算法的开发

JMIR Form Res. 2023 Jun 22;7:e44876. doi: 10.2196/44876.

Extracting Pulmonary Nodules and Nodule Characteristics from Radiology Reports of Lung Cancer Screening Patients Using Transformer Models.使用Transformer模型从肺癌筛查患者的放射学报告中提取肺结节及结节特征

J Healthc Inform Res. 2024 May 17;8(3):463-477. doi: 10.1007/s41666-024-00166-5. eCollection 2024 Sep.

CACER: Clinical concept Annotations for Cancer Events and Relations.CACER：癌症事件与关系的临床概念注释。

J Am Med Inform Assoc. 2024 Nov 1;31(11):2583-2594. doi: 10.1093/jamia/ocae231.

A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance.深度学习模型在不同类别不平衡程度的非结构化医疗记录文本分类中的对比研究。

BMC Med Res Methodol. 2022 Jul 2;22(1):181. doi: 10.1186/s12874-022-01665-y.

Critical assessment of transformer-based AI models for German clinical notes.基于变压器的德国临床记录人工智能模型的批判性评估。

JAMIA Open. 2022 Nov 15;5(4):ooac087. doi: 10.1093/jamiaopen/ooac087. eCollection 2022 Dec.

Deep Learning Approach for Negation and Speculation Detection for Automated Important Finding Flagging and Extraction in Radiology Report: Internal Validation and Technique Comparison Study.用于放射学报告中自动重要发现标记和提取的否定与推测检测的深度学习方法：内部验证与技术比较研究

JMIR Med Inform. 2023 Apr 25;11:e46348. doi: 10.2196/46348.

Comparing Pre-trained and Feature-Based Models for Prediction of Alzheimer's Disease Based on Speech.基于语音比较预训练模型和基于特征的模型对阿尔茨海默病的预测

Front Aging Neurosci. 2021 Apr 27;13:635945. doi: 10.3389/fnagi.2021.635945. eCollection 2021.

引用本文的文献

Dynamic few-shot prompting for clinical note section classification using lightweight, open-source large language models.使用轻量级开源大语言模型进行临床笔记章节分类的动态少样本提示

J Am Med Inform Assoc. 2025 Jul 1;32(7):1164-1173. doi: 10.1093/jamia/ocaf084.

Clinical document corpora-real ones, translated and synthetic substitutes, and assorted domain proxies: a survey of diversity in corpus design, with focus on German text data.临床文档语料库——真实语料库、翻译语料库和合成替代语料库，以及各类领域替代语料库：语料库设计多样性调查，重点关注德语文本数据

JAMIA Open. 2025 May 14;8(3):ooaf024. doi: 10.1093/jamiaopen/ooaf024. eCollection 2025 Jun.

Diagnosis extraction from unstructured Dutch echocardiogram reports using span- and document-level characteristic classification.使用跨度和文档级特征分类从非结构化荷兰语超声心动图报告中提取诊断信息。

BMC Med Inform Decis Mak. 2025 Mar 7;25(1):115. doi: 10.1186/s12911-025-02897-w.

Critical assessment of transformer-based AI models for German clinical notes.基于变压器的德国临床记录人工智能模型的批判性评估。

JAMIA Open. 2022 Nov 15;5(4):ooac087. doi: 10.1093/jamiaopen/ooac087. eCollection 2022 Dec.

本文引用的文献

Annotation and initial evaluation of a large annotated German oncological corpus.一个大型带注释的德语肿瘤学语料库的注释与初步评估。

JAMIA Open. 2021 Apr 19;4(2):ooab025. doi: 10.1093/jamiaopen/ooab025. eCollection 2021 Apr.

Medical Information Extraction in the Age of Deep Learning.深度学习时代的医学信息抽取。

Yearb Med Inform. 2020 Aug;29(1):208-220. doi: 10.1055/s-0040-1702001. Epub 2020 Aug 21.

Highly accurate classification of chest radiographic reports using a deep learning natural language model pre-trained on 3.8 million text reports.利用在 380 万份文本报告上预训练的深度学习自然语言模型，实现胸部 X 光报告的高精度分类。

Bioinformatics. 2021 Jan 29;36(21):5255-5261. doi: 10.1093/bioinformatics/btaa668.

An Evolutionary Approach to the Annotation of Discharge Summaries.一种用于出院小结注释的进化方法。

Stud Health Technol Inform. 2020 Jun 16;270:28-32. doi: 10.3233/SHTI200116.

Information Extraction from Echocardiography Reports for a Clinical Follow-up Study-Comparison of Extracted Variables Intended for General Use in a Data Warehouse with Those Intended Specifically for the Study.用于临床随访研究的超声心动图报告信息提取——数据仓库中通用提取变量与专门用于该研究的提取变量的比较

Methods Inf Med. 2019 Nov;58(4-05):140-150. doi: 10.1055/s-0039-3402069. Epub 2020 Jan 30.

Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)-Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study.基于大规模电子健康记录笔记对基于变换器的双向编码器表征（BERT）模型进行微调：一项实证研究。

JMIR Med Inform. 2019 Sep 12;7(3):e14830. doi: 10.2196/14830.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT：一种用于生物医学文本挖掘的预训练生物医学语言表示模型。

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

Deep Learning Approaches Outperform Conventional Strategies in De-Identification of German Medical Reports.深度学习方法在德国医学报告去识别化方面优于传统策略。

Stud Health Technol Inform. 2019 Sep 3;267:101-109. doi: 10.3233/SHTI190813.

Enhancing clinical concept extraction with contextual embeddings.利用上下文嵌入增强临床概念提取。

J Am Med Inform Assoc. 2019 Nov 1;26(11):1297-1304. doi: 10.1093/jamia/ocz096.

CDA-Compliant Section Annotation of German-Language Discharge Summaries: Guideline Development, Annotation Campaign, Section Classification.德语出院小结的符合临床文档架构（CDA）部分标注：指南制定、标注活动、部分分类

AMIA Annu Symp Proc. 2018 Dec 5;2018:770-779. eCollection 2018.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用预训练语言模型从德语出院小结中自动提取12个心血管概念。

Automatic extraction of 12 cardiovascular concepts from German discharge letters using pre-trained language models.

作者信息

机构信息

出版信息

OBJECTIVE

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献