Division of Health and Biomedical Informatics, Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA.
Division of Cardiology, Department of Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA.
J Am Med Inform Assoc. 2023 Jan 18;30(2):340-347. doi: 10.1093/jamia/ocac225.
Clinical knowledge-enriched transformer models (eg, ClinicalBERT) have state-of-the-art results on clinical natural language processing (NLP) tasks. One of the core limitations of these transformer models is the substantial memory consumption due to their full self-attention mechanism, which leads to the performance degradation in long clinical texts. To overcome this, we propose to leverage long-sequence transformer models (eg, Longformer and BigBird), which extend the maximum input sequence length from 512 to 4096, to enhance the ability to model long-term dependencies in long clinical texts.
Inspired by the success of long-sequence transformer models and the fact that clinical notes are mostly long, we introduce 2 domain-enriched language models, Clinical-Longformer and Clinical-BigBird, which are pretrained on a large-scale clinical corpus. We evaluate both language models using 10 baseline tasks including named entity recognition, question answering, natural language inference, and document classification tasks.
The results demonstrate that Clinical-Longformer and Clinical-BigBird consistently and significantly outperform ClinicalBERT and other short-sequence transformers in all 10 downstream tasks and achieve new state-of-the-art results.
Our pretrained language models provide the bedrock for clinical NLP using long texts. We have made our source code available at https://github.com/luoyuanlab/Clinical-Longformer, and the pretrained models available for public download at: https://huggingface.co/yikuan8/Clinical-Longformer.
This study demonstrates that clinical knowledge-enriched long-sequence transformers are able to learn long-term dependencies in long clinical text. Our methods can also inspire the development of other domain-enriched long-sequence transformers.
临床知识增强型转换器模型(例如 ClinicalBERT)在临床自然语言处理(NLP)任务中取得了最先进的成果。这些转换器模型的核心局限性之一是由于其全自注意力机制而导致的大量内存消耗,这导致在长临床文本中性能下降。为了克服这一问题,我们建议利用长序列转换器模型(例如 Longformer 和 BigBird),将最大输入序列长度从 512 扩展到 4096,以增强对长临床文本中长时依赖关系建模的能力。
受长序列转换器模型成功的启发,以及临床笔记大多较长的事实,我们引入了 2 个领域增强型语言模型 Clinical-Longformer 和 Clinical-BigBird,它们是在大规模临床语料库上进行预训练的。我们使用包括命名实体识别、问答、自然语言推理和文档分类任务在内的 10 个基准任务来评估这两种语言模型。
结果表明,Clinical-Longformer 和 Clinical-BigBird 在所有 10 个下游任务中始终如一地显著优于 ClinicalBERT 和其他短序列转换器,并实现了新的最先进的结果。
我们的预训练语言模型为使用长文本进行临床 NLP 提供了基础。我们已经将源代码发布在 https://github.com/luoyuanlab/Clinical-Longformer 上,并在 https://huggingface.co/yikuan8/Clinical-Longformer 上提供了预训练模型供公众下载。
本研究表明,临床知识增强型长序列转换器能够学习长临床文本中的长时依赖关系。我们的方法还可以启发其他领域增强型长序列转换器的发展。