Si Yuqi, Bernstam Elmer V, Roberts Kirk
School of Biomedical Informatics, The University of Texas Health Science Center at Houston, TX, USA.
School of Biomedical Informatics, The University of Texas Health Science Center at Houston, TX, USA; Division of General Internal Medicine, McGovern Medical School, The University of Texas Health Science Center at Houston, TX, USA.
J Biomed Inform. 2021 Apr;116:103726. doi: 10.1016/j.jbi.2021.103726. Epub 2021 Mar 9.
The paradigm of representation learning through transfer learning has the potential to greatly enhance clinical natural language processing. In this work, we propose a multi-task pre-training and fine-tuning approach for learning generalized and transferable patient representations from medical language. The model is first pre-trained with different but related high-prevalence phenotypes and further fine-tuned on downstream target tasks. Our main contribution focuses on the impact this technique can have on low-prevalence phenotypes, a challenging task due to the dearth of data. We validate the representation from pre-training, and fine-tune the multi-task pre-trained models on low-prevalence phenotypes including 38 circulatory diseases, 23 respiratory diseases, and 17 genitourinary diseases. We find multi-task pre-training increases learning efficiency and achieves consistently high performance across the majority of phenotypes. Most important, the multi-task pre-training is almost always either the best-performing model or performs tolerably close to the best-performing model, a property we refer to as robust. All these results lead us to conclude that this multi-task transfer learning architecture is a robust approach for developing generalized and transferable patient language representations for numerous phenotypes.
通过迁移学习进行表示学习的范式有潜力极大地提升临床自然语言处理。在这项工作中,我们提出了一种多任务预训练和微调方法,用于从医学语言中学习通用且可迁移的患者表示。该模型首先使用不同但相关的高流行表型进行预训练,然后在下游目标任务上进行进一步微调。我们的主要贡献集中在这项技术对低流行表型可能产生的影响上,由于数据匮乏,这是一项具有挑战性的任务。我们验证了预训练得到的表示,并在包括38种循环系统疾病、23种呼吸系统疾病和17种泌尿生殖系统疾病在内的低流行表型上对多任务预训练模型进行微调。我们发现多任务预训练提高了学习效率,并在大多数表型上始终实现高性能。最重要的是,多任务预训练几乎总是要么是性能最佳的模型,要么与性能最佳的模型表现相当接近,我们将这种特性称为稳健性。所有这些结果使我们得出结论,这种多任务迁移学习架构是一种稳健的方法,可用于为众多表型开发通用且可迁移的患者语言表示。