Richter-Pechanski Phillip, Geis Nicolas A, Kiriakou Christina, Schwab Dominic M, Dieterich Christoph
Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, Germany.
Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, Germany.
Digit Health. 2021 Nov 26;7:20552076211057662. doi: 10.1177/20552076211057662. eCollection 2021 Jan-Dec.
A vast amount of medical data is still stored in unstructured text documents. We present an automated method of information extraction from German unstructured clinical routine data from the cardiology domain enabling their usage in state-of-the-art data-driven deep learning projects.
We evaluated pre-trained language models to extract a set of 12 cardiovascular concepts in German discharge letters. We compared three bidirectional encoder representations from transformers pre-trained on different corpora and fine-tuned them on the task of cardiovascular concept extraction using 204 discharge letters manually annotated by cardiologists at the University Hospital Heidelberg. We compared our results with traditional machine learning methods based on a long short-term memory network and a conditional random field.
Our best performing model, based on publicly available German pre-trained bidirectional encoder representations from the transformer model, achieved a token-wise micro-average F1-score of 86% and outperformed the baseline by at least 6%. Moreover, this approach achieved the best trade-off between precision (positive predictive value) and recall (sensitivity).
Our results show the applicability of state-of-the-art deep learning methods using pre-trained language models for the task of cardiovascular concept extraction using limited training data. This minimizes annotation efforts, which are currently the bottleneck of any application of data-driven deep learning projects in the clinical domain for German and many other European languages.
大量医学数据仍存储在非结构化文本文件中。我们提出了一种从心脏病学领域的德语非结构化临床常规数据中自动提取信息的方法,以使这些数据能够用于最新的数据驱动深度学习项目。
我们评估了预训练语言模型,以从德语出院小结中提取一组12个心血管概念。我们比较了在不同语料库上预训练并在心血管概念提取任务上进行微调的三种基于变换器的双向编码器表示,使用海德堡大学医院心脏病专家手动注释的204份出院小结进行微调。我们将结果与基于长短期记忆网络和条件随机场的传统机器学习方法进行了比较。
我们表现最佳的模型基于公开可用的德语预训练变换器模型双向编码器表示,实现了逐词微平均F1分数为86%,比基线至少高出6%。此外,该方法在精度(阳性预测值)和召回率(敏感性)之间实现了最佳平衡。
我们的结果表明,使用预训练语言模型的最新深度学习方法适用于使用有限训练数据进行心血管概念提取的任务。这最大限度地减少了注释工作,而注释工作目前是德语和许多其他欧洲语言在临床领域数据驱动深度学习项目的任何应用的瓶颈。