Lentzen Manuel, Madan Sumit, Lage-Rupprecht Vanessa, Kühnel Lisa, Fluck Juliane, Jacobs Marc, Mittermaier Mirja, Witzenrath Martin, Brunecker Peter, Hofmann-Apitius Martin, Weber Joachim, Fröhlich Holger
Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, Germany.
Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, Germany.
JAMIA Open. 2022 Nov 15;5(4):ooac087. doi: 10.1093/jamiaopen/ooac087. eCollection 2022 Dec.
Healthcare data such as clinical notes are primarily recorded in an unstructured manner. If adequately translated into structured data, they can be utilized for health economics and set the groundwork for better individualized patient care. To structure clinical notes, deep-learning methods, particularly transformer-based models like , have recently received much attention. Currently, biomedical applications are primarily focused on the English language. While general-purpose German-language models such as GermanBERT and GottBERT have been published, adaptations for biomedical data are unavailable. This study evaluated the suitability of existing and novel transformer-based models for the German biomedical and clinical domain.
We used 8 transformer-based models and pre-trained 3 new models on a newly generated biomedical corpus, and systematically compared them with each other. We annotated a new dataset of clinical notes and used it with 4 other corpora (BRONCO150, CLEF eHealth 2019 Task 1, GGPONC, and JSynCC) to perform named entity recognition (NER) and document classification tasks.
General-purpose language models can be used effectively for biomedical and clinical natural language processing (NLP) tasks, still, our newly trained BioGottBERT model outperformed GottBERT on both clinical NER tasks. However, training new biomedical models from scratch proved ineffective.
The domain-adaptation strategy's potential is currently limited due to a lack of pre-training data. Since general-purpose language models are only marginally inferior to domain-specific models, both options are suitable for developing German-language biomedical applications.
General-purpose language models perform remarkably well on biomedical and clinical NLP tasks. If larger corpora become available in the future, domain-adapting these models may improve performances.
诸如临床记录等医疗保健数据主要以非结构化方式记录。如果能充分转化为结构化数据,它们可用于卫生经济学,并为更好的个性化患者护理奠定基础。为了构建临床记录的结构,深度学习方法,特别是像基于Transformer的模型,最近受到了广泛关注。目前,生物医学应用主要集中在英语语言上。虽然已经发布了诸如GermanBERT和GottBERT等通用德语模型,但尚无针对生物医学数据的改编版本。本研究评估了现有和新型基于Transformer的模型在德国生物医学和临床领域的适用性。
我们使用了8个基于Transformer的模型,并在新生成的生物医学语料库上预训练了3个新模型,并对它们进行了系统的相互比较。我们注释了一个新的临床记录数据集,并将其与其他4个语料库(BRONCO150、CLEF eHealth 2019任务1、GGPONC和JSynCC)一起用于执行命名实体识别(NER)和文档分类任务。
通用语言模型可有效地用于生物医学和临床自然语言处理(NLP)任务,尽管如此,我们新训练的BioGottBERT模型在两项临床NER任务上均优于GottBERT。然而,从头开始训练新的生物医学模型被证明是无效的。
由于缺乏预训练数据,目前领域适应策略的潜力有限。由于通用语言模型仅略逊于特定领域模型,这两种选择都适用于开发德语生物医学应用程序。
通用语言模型在生物医学和临床NLP任务上表现出色。如果未来有更大的语料库可用,对这些模型进行领域适应可能会提高性能。