基于预训练语言模型的长临床文本比较研究

A comparative study of pretrained language models for long clinical text.

机构信息

Division of Health and Biomedical Informatics, Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA.

Division of Cardiology, Department of Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA.

出版信息

J Am Med Inform Assoc. 2023 Jan 18;30(2):340-347. doi: 10.1093/jamia/ocac225.

DOI:10.1093/jamia/ocac225

PMID:36451266

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9846675/

Abstract

OBJECTIVE

Clinical knowledge-enriched transformer models (eg, ClinicalBERT) have state-of-the-art results on clinical natural language processing (NLP) tasks. One of the core limitations of these transformer models is the substantial memory consumption due to their full self-attention mechanism, which leads to the performance degradation in long clinical texts. To overcome this, we propose to leverage long-sequence transformer models (eg, Longformer and BigBird), which extend the maximum input sequence length from 512 to 4096, to enhance the ability to model long-term dependencies in long clinical texts.

MATERIALS AND METHODS

Inspired by the success of long-sequence transformer models and the fact that clinical notes are mostly long, we introduce 2 domain-enriched language models, Clinical-Longformer and Clinical-BigBird, which are pretrained on a large-scale clinical corpus. We evaluate both language models using 10 baseline tasks including named entity recognition, question answering, natural language inference, and document classification tasks.

RESULTS

The results demonstrate that Clinical-Longformer and Clinical-BigBird consistently and significantly outperform ClinicalBERT and other short-sequence transformers in all 10 downstream tasks and achieve new state-of-the-art results.

DISCUSSION

Our pretrained language models provide the bedrock for clinical NLP using long texts. We have made our source code available at https://github.com/luoyuanlab/Clinical-Longformer, and the pretrained models available for public download at: https://huggingface.co/yikuan8/Clinical-Longformer.

CONCLUSION

This study demonstrates that clinical knowledge-enriched long-sequence transformers are able to learn long-term dependencies in long clinical text. Our methods can also inspire the development of other domain-enriched long-sequence transformers.

摘要

目的

临床知识增强型转换器模型（例如 ClinicalBERT）在临床自然语言处理（NLP）任务中取得了最先进的成果。这些转换器模型的核心局限性之一是由于其全自注意力机制而导致的大量内存消耗，这导致在长临床文本中性能下降。为了克服这一问题，我们建议利用长序列转换器模型（例如 Longformer 和 BigBird），将最大输入序列长度从 512 扩展到 4096，以增强对长临床文本中长时依赖关系建模的能力。

材料和方法

受长序列转换器模型成功的启发，以及临床笔记大多较长的事实，我们引入了 2 个领域增强型语言模型 Clinical-Longformer 和 Clinical-BigBird，它们是在大规模临床语料库上进行预训练的。我们使用包括命名实体识别、问答、自然语言推理和文档分类任务在内的 10 个基准任务来评估这两种语言模型。

结果

结果表明，Clinical-Longformer 和 Clinical-BigBird 在所有 10 个下游任务中始终如一地显著优于 ClinicalBERT 和其他短序列转换器，并实现了新的最先进的结果。

讨论

我们的预训练语言模型为使用长文本进行临床 NLP 提供了基础。我们已经将源代码发布在 https://github.com/luoyuanlab/Clinical-Longformer 上，并在 https://huggingface.co/yikuan8/Clinical-Longformer 上提供了预训练模型供公众下载。

结论

本研究表明，临床知识增强型长序列转换器能够学习长临床文本中的长时依赖关系。我们的方法还可以启发其他领域增强型长序列转换器的发展。

相似文献

A comparative study of pretrained language models for long clinical text.基于预训练语言模型的长临床文本比较研究

J Am Med Inform Assoc. 2023 Jan 18;30(2):340-347. doi: 10.1093/jamia/ocac225.

Transformers-sklearn: a toolkit for medical language understanding with transformer-based models.Transformer-sklearn：一个基于 Transformer 的模型的医学语言理解工具包。

BMC Med Inform Decis Mak. 2021 Jul 30;21(Suppl 2):90. doi: 10.1186/s12911-021-01459-0.

Clinical concept extraction using transformers.使用转换器进行临床概念提取。

J Am Med Inform Assoc. 2020 Dec 9;27(12):1935-1942. doi: 10.1093/jamia/ocaa189.

AMMU: A survey of transformer-based biomedical pretrained language models.基于变压器的生物医学预训练语言模型综述。

J Biomed Inform. 2022 Feb;126:103982. doi: 10.1016/j.jbi.2021.103982. Epub 2021 Dec 31.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT：一种用于生物医学文本挖掘的预训练生物医学语言表示模型。

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification.当 BERT 遇见比尔博：预训练语言模型在疾病分类上的学习曲线分析。

BMC Med Inform Decis Mak. 2022 Apr 5;21(Suppl 9):377. doi: 10.1186/s12911-022-01829-2.

Bioformer: an efficient transformer language model for biomedical text mining.生物former：一种用于生物医学文本挖掘的高效Transformer语言模型。

ArXiv. 2023 Feb 3:arXiv:2302.01588v1.

Extracting comprehensive clinical information for breast cancer using deep learning methods.利用深度学习方法提取乳腺癌全面临床信息。

Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2.

Identify diabetic retinopathy-related clinical concepts and their attributes using transformer-based natural language processing methods.使用基于转换器的自然语言处理方法识别与糖尿病视网膜病变相关的临床概念及其属性。

BMC Med Inform Decis Mak. 2022 Sep 27;22(Suppl 3):255. doi: 10.1186/s12911-022-01996-2.

KEBLM: Knowledge-Enhanced Biomedical Language Models.KEBLM：知识增强型生物医学语言模型。

J Biomed Inform. 2023 Jul;143:104392. doi: 10.1016/j.jbi.2023.104392. Epub 2023 May 19.

引用本文的文献

GRACE-ICU: A multimodal nomogram-based approach for illness severity assessment of older adults in the ICU.GRACE-ICU：一种基于多模态列线图的重症监护病房老年患者疾病严重程度评估方法。

NPJ Digit Med. 2025 Aug 13;8(1):519. doi: 10.1038/s41746-025-01875-w.

Language Models for Multilabel Document Classification of Surgical Concepts in Exploratory Laparotomy Operative Notes: Algorithm Development Study.用于探索性剖腹手术记录中手术概念多标签文档分类的语言模型：算法开发研究

JMIR Med Inform. 2025 Jul 9;13:e71176. doi: 10.2196/71176.

A Pan-Organ Vision-Language Model for Generalizable 3D CT Representations.用于可泛化3D CT表征的全器官视觉语言模型。

medRxiv. 2025 Jul 3:2025.07.03.25330654. doi: 10.1101/2025.07.03.25330654.

Performance analysis of large language models Chatgpt-4o, OpenAI O1, and OpenAI O3 mini in clinical treatment of pneumonia: a comparative study.大语言模型Chatgpt-4o、OpenAI O1和OpenAI O3 mini在肺炎临床治疗中的性能分析：一项对比研究。

Clin Exp Med. 2025 Jun 20;25(1):213. doi: 10.1007/s10238-025-01743-7.

Synthetic4Health: generating annotated synthetic clinical letters.合成4健康：生成带注释的合成临床信件。

Front Digit Health. 2025 May 30;7:1497130. doi: 10.3389/fdgth.2025.1497130. eCollection 2025.

Automated Insomnia Phenotyping from Electronic Health Records: Leveraging Large Language Models to Decode Clinical Narratives.基于电子健康记录的自动失眠症表型分析：利用大语言模型解读临床叙述。

medRxiv. 2025 Jun 3:2025.06.02.25328701. doi: 10.1101/2025.06.02.25328701.

Iterative refinement and goal articulation to optimize large language models for clinical information extraction.迭代优化与目标阐述以优化用于临床信息提取的大语言模型

NPJ Digit Med. 2025 May 23;8(1):301. doi: 10.1038/s41746-025-01686-z.

Hybrid machine learning for real-time prediction of edema trajectory in large middle cerebral artery stroke.用于实时预测大脑中动脉大面积卒中水肿轨迹的混合机器学习

NPJ Digit Med. 2025 May 17;8(1):288. doi: 10.1038/s41746-025-01687-y.

Development of machine learning-based mpox surveillance models in a learning health system.在学习型健康系统中基于机器学习的猴痘监测模型的开发。

Sex Transm Infect. 2025 May 2. doi: 10.1136/sextrans-2024-056382.

Year 2023 in Biomedical Natural Language Processing: a Tribute to Large Language Models and Generative AI.2023年生物医学自然语言处理领域：向大语言模型和生成式人工智能致敬。

Yearb Med Inform. 2024 Aug;33(1):241-248. doi: 10.1055/s-0044-1800751. Epub 2025 Apr 8.

本文引用的文献

CancerBERT: a cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records.CancerBERT：一种癌症领域特定的语言模型，用于从电子健康记录中提取乳腺癌表型。

J Am Med Inform Assoc. 2022 Jun 14;29(7):1208-1216. doi: 10.1093/jamia/ocac040.

Unstructured clinical notes within the 24 hours since admission predict short, mid & long-term mortality in adult ICU patients.入院后 24 小时内的非结构化临床记录可预测成人 ICU 患者的短期、中期和长期死亡率。

PLoS One. 2022 Jan 6;17(1):e0262182. doi: 10.1371/journal.pone.0262182. eCollection 2022.

Limitations of Transformers on Clinical Text Classification.Transformer 在临床文本分类上的局限性。

IEEE J Biomed Health Inform. 2021 Sep;25(9):3596-3607. doi: 10.1109/JBHI.2021.3062322. Epub 2021 Sep 3.

Early Prediction of Acute Kidney Injury in Critical Care Setting Using Clinical Notes.利用临床记录对重症监护环境中的急性肾损伤进行早期预测。

Proceedings (IEEE Int Conf Bioinformatics Biomed). 2018 Dec;2018:683-686. doi: 10.1109/bibm.2018.8621574. Epub 2019 Jan 24.

Adapting and evaluating a deep learning language model for clinical why-question answering.改编并评估用于临床“为什么”问题回答的深度学习语言模型。

JAMIA Open. 2020 Feb 4;3(1):16-20. doi: 10.1093/jamiaopen/ooz072. eCollection 2020 Apr.

MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.MIMIC-CXR，一个去标识化的、公开可用的、包含自由文本报告的胸部 X 光数据库。

Sci Data. 2019 Dec 12;6(1):317. doi: 10.1038/s41597-019-0322-0.

Traditional Chinese medicine clinical records classification with BERT and domain specific corpora.基于 BERT 和领域专用语料库的中医临床记录分类。

J Am Med Inform Assoc. 2019 Dec 1;26(12):1632-1636. doi: 10.1093/jamia/ocz164.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT：一种用于生物医学文本挖掘的预训练生物医学语言表示模型。

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

Early Prediction of Acute Kidney Injury in Critical Care Setting Using Clinical Notes and Structured Multivariate Physiological Measurements.利用临床记录和结构化多变量生理测量对重症监护环境中的急性肾损伤进行早期预测。

Stud Health Technol Inform. 2019 Aug 21;264:368-372. doi: 10.3233/SHTI190245.

CollaboNet: collaboration of deep neural networks for biomedical named entity recognition.CollaboNet：用于生物医学命名实体识别的深度神经网络协作。

BMC Bioinformatics. 2019 May 29;20(Suppl 10):249. doi: 10.1186/s12859-019-2813-6.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验