Suppr超能文献

在基于转换器的双向编码器表示预训练(BERT)中进行过采样,以定位医学 BERT 并增强生物医学 BERT。

Oversampling effect in pretraining for bidirectional encoder representations from transformers (BERT) to localize medical BERT and enhance biomedical BERT.

机构信息

Department of Medical Informatics, Osaka University Graduate School of Medicine, Japan.

Department of Medical Informatics, Osaka University Graduate School of Medicine, Japan.

出版信息

Artif Intell Med. 2024 Jul;153:102889. doi: 10.1016/j.artmed.2024.102889. Epub 2024 May 5.

Abstract

BACKGROUND

Pretraining large-scale neural language models on raw texts has made a significant contribution to improving transfer learning in natural language processing. With the introduction of transformer-based language models, such as bidirectional encoder representations from transformers (BERT), the performance of information extraction from free text has improved significantly in both the general and medical domains. However, it is difficult to train specific BERT models to perform well in domains for which few databases of a high quality and large size are publicly available.

OBJECTIVE

We hypothesized that this problem could be addressed by oversampling a domain-specific corpus and using it for pretraining with a larger corpus in a balanced manner. In the present study, we verified our hypothesis by developing pretraining models using our method and evaluating their performance.

METHODS

Our proposed method was based on the simultaneous pretraining of models with knowledge from distinct domains after oversampling. We conducted three experiments in which we generated (1) English biomedical BERT from a small biomedical corpus, (2) Japanese medical BERT from a small medical corpus, and (3) enhanced biomedical BERT pretrained with complete PubMed abstracts in a balanced manner. We then compared their performance with those of conventional models.

RESULTS

Our English BERT pretrained using both general and small medical domain corpora performed sufficiently well for practical use on the biomedical language understanding evaluation (BLUE) benchmark. Moreover, our proposed method was more effective than the conventional methods for each biomedical corpus of the same corpus size in the general domain. Our Japanese medical BERT outperformed the other BERT models built using a conventional method for almost all the medical tasks. The model demonstrated the same trend as that of the first experiment in English. Further, our enhanced biomedical BERT model, which was not pretrained on clinical notes, achieved superior clinical and biomedical scores on the BLUE benchmark with an increase of 0.3 points in the clinical score and 0.5 points in the biomedical score. These scores were above those of the models trained without our proposed method.

CONCLUSIONS

Well-balanced pretraining using oversampling instances derived from a corpus appropriate for the target task allowed us to construct a high-performance BERT model.

摘要

背景

在原始文本上对大规模神经语言模型进行预训练,极大地促进了自然语言处理中的迁移学习。随着基于转换器的语言模型(如双向编码器表示转换器(BERT))的引入,无论是在一般领域还是医学领域,从自由文本中提取信息的性能都得到了显著提高。然而,对于那些数据库数量少且质量不高的特定领域,很难训练出性能良好的特定 BERT 模型。

目的

我们假设可以通过对特定领域语料库进行过采样,并使用它与更大的语料库以平衡的方式进行预训练来解决这个问题。在本研究中,我们通过使用我们的方法开发预训练模型并对其性能进行评估来验证我们的假设。

方法

我们的方法基于在过采样后同时使用来自不同领域的知识对模型进行预训练。我们进行了三项实验,其中包括:(1)从小型生物医学语料库中生成英语生物医学 BERT,(2)从小型医学语料库中生成日语医学 BERT,(3)使用平衡方式对完整的 PubMed 摘要进行预训练的增强型生物医学 BERT。然后,我们将它们的性能与传统模型进行了比较。

结果

我们使用一般语料库和小型医学语料库对英语 BERT 进行的预训练,在生物医学语言理解评估(BLUE)基准测试中表现良好,足以实际应用。此外,在相同的一般领域语料库中,与传统方法相比,我们的方法对每个生物医学语料库都更有效。我们的日语医学 BERT 在几乎所有医学任务中都优于使用传统方法构建的其他 BERT 模型。该模型在英语中的表现与第一项实验相同。此外,我们的增强型生物医学 BERT 模型未在临床笔记上进行预训练,在 BLUE 基准测试中获得了优异的临床和生物医学分数,临床分数提高了 0.3 分,生物医学分数提高了 0.5 分。这些分数高于没有使用我们提出的方法进行训练的模型的分数。

结论

使用适当的目标任务语料库进行平衡的预训练,通过过采样实例,我们可以构建高性能的 BERT 模型。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验