用于有限数据表型分析的通用且可转移的患者语言表示

Generalized and transferable patient language representation for phenotyping with limited data.

作者信息

Si Yuqi, Bernstam Elmer V, Roberts Kirk

机构信息

School of Biomedical Informatics, The University of Texas Health Science Center at Houston, TX, USA.

School of Biomedical Informatics, The University of Texas Health Science Center at Houston, TX, USA; Division of General Internal Medicine, McGovern Medical School, The University of Texas Health Science Center at Houston, TX, USA.

出版信息

J Biomed Inform. 2021 Apr;116:103726. doi: 10.1016/j.jbi.2021.103726. Epub 2021 Mar 9.

DOI:10.1016/j.jbi.2021.103726

PMID:33711541

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11577729/

Abstract

The paradigm of representation learning through transfer learning has the potential to greatly enhance clinical natural language processing. In this work, we propose a multi-task pre-training and fine-tuning approach for learning generalized and transferable patient representations from medical language. The model is first pre-trained with different but related high-prevalence phenotypes and further fine-tuned on downstream target tasks. Our main contribution focuses on the impact this technique can have on low-prevalence phenotypes, a challenging task due to the dearth of data. We validate the representation from pre-training, and fine-tune the multi-task pre-trained models on low-prevalence phenotypes including 38 circulatory diseases, 23 respiratory diseases, and 17 genitourinary diseases. We find multi-task pre-training increases learning efficiency and achieves consistently high performance across the majority of phenotypes. Most important, the multi-task pre-training is almost always either the best-performing model or performs tolerably close to the best-performing model, a property we refer to as robust. All these results lead us to conclude that this multi-task transfer learning architecture is a robust approach for developing generalized and transferable patient language representations for numerous phenotypes.

摘要

通过迁移学习进行表示学习的范式有潜力极大地提升临床自然语言处理。在这项工作中，我们提出了一种多任务预训练和微调方法，用于从医学语言中学习通用且可迁移的患者表示。该模型首先使用不同但相关的高流行表型进行预训练，然后在下游目标任务上进行进一步微调。我们的主要贡献集中在这项技术对低流行表型可能产生的影响上，由于数据匮乏，这是一项具有挑战性的任务。我们验证了预训练得到的表示，并在包括38种循环系统疾病、23种呼吸系统疾病和17种泌尿生殖系统疾病在内的低流行表型上对多任务预训练模型进行微调。我们发现多任务预训练提高了学习效率，并在大多数表型上始终实现高性能。最重要的是，多任务预训练几乎总是要么是性能最佳的模型，要么与性能最佳的模型表现相当接近，我们将这种特性称为稳健性。所有这些结果使我们得出结论，这种多任务迁移学习架构是一种稳健的方法，可用于为众多表型开发通用且可迁移的患者语言表示。

相似文献

Generalized and transferable patient language representation for phenotyping with limited data.

J Biomed Inform. 2021 Apr;116:103726. doi: 10.1016/j.jbi.2021.103726. Epub 2021 Mar 9.

Med7: A transferable clinical natural language processing model for electronic health records.

Artif Intell Med. 2021 Aug;118:102086. doi: 10.1016/j.artmed.2021.102086. Epub 2021 May 18.

Drug knowledge discovery via multi-task learning and pre-trained models.

BMC Med Inform Decis Mak. 2021 Nov 16;21(Suppl 9):251. doi: 10.1186/s12911-021-01614-7.

Comparing Pre-trained and Feature-Based Models for Prediction of Alzheimer's Disease Based on Speech.

Front Aging Neurosci. 2021 Apr 27;13:635945. doi: 10.3389/fnagi.2021.635945. eCollection 2021.

DeBERTa-BiLSTM: A multi-label classification model of Arabic medical questions using pre-trained models and deep learning.

Comput Biol Med. 2024 Mar;170:107921. doi: 10.1016/j.compbiomed.2024.107921. Epub 2024 Jan 4.

BioInstruct: instruction tuning of large language models for biomedical natural language processing.

J Am Med Inform Assoc. 2024 Sep 1;31(9):1821-1832. doi: 10.1093/jamia/ocae122.

Medical text classification based on the discriminative pre-training model and prompt-tuning.

Digit Health. 2023 Aug 6;9:20552076231193213. doi: 10.1177/20552076231193213. eCollection 2023 Jan-Dec.

A clinical text classification paradigm using weak supervision and deep representation.

BMC Med Inform Decis Mak. 2019 Jan 7;19(1):1. doi: 10.1186/s12911-018-0723-6.

Leveraging pre-trained language models for mining microbiome-disease relationships.

BMC Bioinformatics. 2023 Jul 19;24(1):290. doi: 10.1186/s12859-023-05411-z.

Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning.

JMIR Med Inform. 2020 Nov 27;8(11):e22508. doi: 10.2196/22508.

引用本文的文献

Automated Shared Phenotype Discovery in Undiagnosed Cohorts for Rare Disease Research.

Proc Int Conf Mach Learn Appl. 2024 Dec;2024:1025-1030. doi: 10.1109/icmla61862.2024.00154. Epub 2025 Mar 4.

Deep representation learning of patient data from Electronic Health Records (EHR): A systematic review.

J Biomed Inform. 2021 Mar;115:103671. doi: 10.1016/j.jbi.2020.103671. Epub 2020 Dec 31.

本文引用的文献

Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction.

NPJ Digit Med. 2021 May 20;4(1):86. doi: 10.1038/s41746-021-00455-y.

Deep representation learning of patient data from Electronic Health Records (EHR): A systematic review.

J Biomed Inform. 2021 Mar;115:103671. doi: 10.1016/j.jbi.2020.103671. Epub 2020 Dec 31.

Language models are an effective representation learning technique for electronic health record data.

J Biomed Inform. 2021 Jan;113:103637. doi: 10.1016/j.jbi.2020.103637. Epub 2020 Dec 5.

PheMap: a multi-resource knowledge base for high-throughput phenotyping within electronic health records.

J Am Med Inform Assoc. 2020 Nov 1;27(11):1675-1687. doi: 10.1093/jamia/ocaa104.

The use of machine learning in rare diseases: a scoping review.

Orphanet J Rare Dis. 2020 Jun 9;15(1):145. doi: 10.1186/s13023-020-01424-6.

Patient Representation Transfer Learning from Clinical Notes based on Hierarchical Attention Network.

AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:597-606. eCollection 2020.

BEHRT: Transformer for Electronic Health Records.

Sci Rep. 2020 Apr 28;10(1):7155. doi: 10.1038/s41598-020-62922-y.

Learning Hierarchical Representations of Electronic Health Records for Clinical Outcome Prediction.

AMIA Annu Symp Proc. 2020 Mar 4;2019:597-606. eCollection 2019.

Benchmarking Deep Learning Architectures for Predicting Readmission to the ICU and Describing Patients-at-Risk.

Sci Rep. 2020 Jan 24;10(1):1111. doi: 10.1038/s41598-020-58053-z.

Scalable and accurate deep learning with electronic health records.

NPJ Digit Med. 2018 May 8;1:18. doi: 10.1038/s41746-018-0029-1. eCollection 2018.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于有限数据表型分析的通用且可转移的患者语言表示

Generalized and transferable patient language representation for phenotyping with limited data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献