Suppr超能文献

生物former:一种用于生物医学文本挖掘的高效Transformer语言模型。

Bioformer: an efficient transformer language model for biomedical text mining.

作者信息

Fang Li, Chen Qingyu, Wei Chih-Hsuan, Lu Zhiyong, Wang Kai

机构信息

Department of Genetics and Biomedical Informatics, Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, 510080, China.

Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA.

出版信息

ArXiv. 2023 Feb 3:arXiv:2302.01588v1.

Abstract

Pretrained language models such as Bidirectional Encoder Representations from Transformers (BERT) have achieved state-of-the-art performance in natural language processing (NLP) tasks. Recently, BERT has been adapted to the biomedical domain. Despite the effectiveness, these models have hundreds of millions of parameters and are computationally expensive when applied to large-scale NLP applications. We hypothesized that the number of parameters of the original BERT can be dramatically reduced with minor impact on performance. In this study, we present Bioformer, a compact BERT model for biomedical text mining. We pretrained two Bioformer models (named and ) which reduced the model size by 60% compared to BERT. Bioformer uses a biomedical vocabulary and was pre-trained from scratch on PubMed abstracts and PubMed Central full-text articles. We thoroughly evaluated the performance of Bioformer as well as existing biomedical BERT models including and on 15 benchmark datasets of four different biomedical NLP tasks: named entity recognition, relation extraction, question answering and document classification. The results show that with 60% fewer parameters, is only 0.1% less accurate than while is 0.9% less accurate than . Both and outperformed . In addition, and are 2-3 fold as fast as PubMedBERT/. Bioformer has been successfully deployed to PubTator Central providing gene annotations over 35 million PubMed abstracts and 5 million PubMed Central full-text articles. We make Bioformer publicly available via https://github.com/WGLab/bioformer, including pre-trained models, datasets, and instructions for downstream use.

摘要

诸如来自变换器的双向编码器表示(BERT)等预训练语言模型在自然语言处理(NLP)任务中取得了领先的性能。最近,BERT已被应用于生物医学领域。尽管这些模型很有效,但它们有数亿个参数,在应用于大规模NLP应用时计算成本很高。我们假设,可以在对性能影响较小的情况下大幅减少原始BERT的参数数量。在本研究中,我们提出了Bioformer,一种用于生物医学文本挖掘的紧凑型BERT模型。我们预训练了两个Bioformer模型(名为 和 ),与BERT相比,模型大小减少了60%。Bioformer使用生物医学词汇表,并在PubMed摘要和PubMed Central全文文章上从头开始进行预训练。我们在四个不同生物医学NLP任务的15个基准数据集上,全面评估了Bioformer以及现有的生物医学BERT模型(包括 和 )在命名实体识别、关系提取、问答和文档分类方面的性能。结果表明,参数数量减少60%时, 比 准确率仅低0.1%,而 比 准确率低0.9%。 和 均优于 。此外, 和 的速度比PubMedBERT/快2至3倍。Bioformer已成功部署到PubTator Central,为超过3500万篇PubMed摘要和500万篇PubMed Central全文文章提供基因注释。我们通过https://github.com/WGLab/bioformer公开提供Bioformer,包括预训练模型、数据集和下游使用说明。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ef4c/10029052/617f9f75f9b9/nihpp-2302.01588v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验