基于领域特定的 ALBERT 进行生物医学自然语言处理任务的基准测试。

Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT.

机构信息

School of Computer Science, The University of Sydney, Sydney, Australia.

Biomedical Informatics and Digital Health and Faculty of Medicine and Health, School of Medical Sciences, The University of Sydney, Sydney, Australia.

出版信息

BMC Bioinformatics. 2022 Apr 21;23(1):144. doi: 10.1186/s12859-022-04688-w.

DOI:10.1186/s12859-022-04688-w

PMID:35448946

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9022356/

Abstract

BACKGROUND

The abundance of biomedical text data coupled with advances in natural language processing (NLP) is resulting in novel biomedical NLP (BioNLP) applications. These NLP applications, or tasks, are reliant on the availability of domain-specific language models (LMs) that are trained on a massive amount of data. Most of the existing domain-specific LMs adopted bidirectional encoder representations from transformers (BERT) architecture which has limitations, and their generalizability is unproven as there is an absence of baseline results among common BioNLP tasks.

RESULTS

We present 8 variants of BioALBERT, a domain-specific adaptation of a lite bidirectional encoder representations from transformers (ALBERT), trained on biomedical (PubMed and PubMed Central) and clinical (MIMIC-III) corpora and fine-tuned for 6 different tasks across 20 benchmark datasets. Experiments show that a large variant of BioALBERT trained on PubMed outperforms the state-of-the-art on named-entity recognition (+ 11.09% BLURB score improvement), relation extraction (+ 0.80% BLURB score), sentence similarity (+ 1.05% BLURB score), document classification (+ 0.62% F1-score), and question answering (+ 2.83% BLURB score). It represents a new state-of-the-art in 5 out of 6 benchmark BioNLP tasks.

CONCLUSIONS

The large variant of BioALBERT trained on PubMed achieved a higher BLURB score than previous state-of-the-art models on 5 of the 6 benchmark BioNLP tasks. Depending on the task, 5 different variants of BioALBERT outperformed previous state-of-the-art models on 17 of the 20 benchmark datasets, showing that our model is robust and generalizable in the common BioNLP tasks. We have made BioALBERT freely available which will help the BioNLP community avoid computational cost of training and establish a new set of baselines for future efforts across a broad range of BioNLP tasks.

摘要

背景

生物医学文本数据的丰富性加上自然语言处理（NLP）的进步，正在催生新的生物医学自然语言处理（BioNLP）应用。这些 NLP 应用程序或任务依赖于大量数据上训练的特定领域语言模型（LMs）的可用性。大多数现有的特定领域 LMs 采用来自转换器的双向编码器表示（BERT）架构，该架构存在局限性，并且由于在常见的 BioNLP 任务中缺乏基准结果，因此其泛化能力尚未得到证明。

结果

我们提出了 8 种 BioALBERT 变体，这是一种对生物医学（PubMed 和 PubMed Central）和临床（MIMIC-III）语料库进行训练的 Lite 双向编码器表示的特定领域自适应，针对 20 个基准数据集的 6 个不同任务进行了微调。实验表明，在 PubMed 上训练的大型 BioALBERT 变体在命名实体识别（+BLURB 分数提高 11.09%）、关系提取（+BLURB 分数 0.80%）、句子相似性（+BLURB 分数 1.05%）、文档分类（+F1 分数 0.62%）和问答（+BLURB 分数 2.83%）方面的表现优于最先进的方法。它在 6 个基准 BioNLP 任务中的 5 个任务中代表了新的最先进水平。

结论

在 6 个基准 BioNLP 任务中的 5 个任务中，在 PubMed 上训练的大型 BioALBERT 变体的 BLURB 分数高于以前的最先进模型。根据任务的不同，在 20 个基准数据集的 17 个数据集上，5 种不同的 BioALBERT 变体优于以前的最先进模型，这表明我们的模型在常见的 BioNLP 任务中具有稳健性和可泛化性。我们已经免费提供了 BioALBERT，这将帮助 BioNLP 社区避免训练的计算成本，并为未来广泛的 BioNLP 任务建立新的基准。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b79/9022356/ff9cf25635a3/12859_2022_4688_Fig1_HTML.jpg

相似文献

Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT.基于领域特定的 ALBERT 进行生物医学自然语言处理任务的基准测试。

BMC Bioinformatics. 2022 Apr 21;23(1):144. doi: 10.1186/s12859-022-04688-w.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT：一种用于生物医学文本挖掘的预训练生物医学语言表示模型。

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

Bioformer: an efficient transformer language model for biomedical text mining.生物former：一种用于生物医学文本挖掘的高效Transformer语言模型。

ArXiv. 2023 Feb 3:arXiv:2302.01588v1.

Methods Mol Biol. 2022;2496:221-235. doi: 10.1007/978-1-0716-2305-3_12.

An extensive benchmark study on biomedical text generation and mining with ChatGPT.一项关于使用ChatGPT进行生物医学文本生成和挖掘的广泛基准研究。

Bioinformatics. 2023 Sep 2;39(9). doi: 10.1093/bioinformatics/btad557.

Relation Extraction from Clinical Narratives Using Pre-trained Language Models.使用预训练语言模型从临床叙述中提取关系

AMIA Annu Symp Proc. 2020 Mar 4;2019:1236-1245. eCollection 2019.

Transformers-sklearn: a toolkit for medical language understanding with transformer-based models.Transformer-sklearn：一个基于 Transformer 的模型的医学语言理解工具包。

BMC Med Inform Decis Mak. 2021 Jul 30;21(Suppl 2):90. doi: 10.1186/s12911-021-01459-0.

When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification.当 BERT 遇见比尔博：预训练语言模型在疾病分类上的学习曲线分析。

BMC Med Inform Decis Mak. 2022 Apr 5;21(Suppl 9):377. doi: 10.1186/s12911-022-01829-2.

Evaluating sentence representations for biomedical text: Methods and experimental results.评价生物医学文本的句子表示方法及实验结果。

J Biomed Inform. 2020 Apr;104:103396. doi: 10.1016/j.jbi.2020.103396. Epub 2020 Mar 6.

Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)-Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study.基于大规模电子健康记录笔记对基于变换器的双向编码器表征（BERT）模型进行微调：一项实证研究。

JMIR Med Inform. 2019 Sep 12;7(3):e14830. doi: 10.2196/14830.

引用本文的文献

Bridging artificial intelligence and biological sciences: a comprehensive review of large language models in bioinformatics.连接人工智能与生物科学：生物信息学中大型语言模型的全面综述

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf357.

Artificial intelligence in healthcare text processing: a review applied to named entity recognition.医疗文本处理中的人工智能：应用于命名实体识别的综述

Front Artif Intell. 2025 Jul 7;8:1584203. doi: 10.3389/frai.2025.1584203. eCollection 2025.

Genome language modeling (GLM): a beginner's cheat sheet.基因组语言建模（GLM）：初学者简易指南。

Biol Methods Protoc. 2025 Mar 25;10(1):bpaf022. doi: 10.1093/biomethods/bpaf022. eCollection 2025.

Improving Systematic Review Updates With Natural Language Processing Through Abstract Component Classification and Selection: Algorithm Development and Validation.通过摘要成分分类和选择利用自然语言处理改进系统评价更新：算法开发与验证

JMIR Med Inform. 2025 Mar 27;13:e65371. doi: 10.2196/65371.

Machine Learning Descriptors for CO Capture Materials.用于二氧化碳捕集材料的机器学习描述符

Molecules. 2025 Feb 1;30(3):650. doi: 10.3390/molecules30030650.

Transformer models in biomedicine.生物医学中的 Transformer 模型。

BMC Med Inform Decis Mak. 2024 Jul 29;24(1):214. doi: 10.1186/s12911-024-02600-5.

Language model and its interpretability in biomedicine: A scoping review.语言模型及其在生物医学中的可解释性：一项范围综述。

iScience. 2024 Feb 24;27(4):109334. doi: 10.1016/j.isci.2024.109334. eCollection 2024 Apr 19.

Question answering systems for health professionals at the point of care-a systematic review.在护理点为医疗保健专业人员提供问答系统——系统评价。

J Am Med Inform Assoc. 2024 Apr 3;31(4):1009-1024. doi: 10.1093/jamia/ocae015.

Standigm ASK™: knowledge graph and artificial intelligence platform applied to target discovery in idiopathic pulmonary fibrosis.Standigm ASK™：应用于特发性肺纤维化靶点发现的知识图谱和人工智能平台。

Brief Bioinform. 2024 Jan 22;25(2). doi: 10.1093/bib/bbae035.

RDBridge: a knowledge graph of rare diseases based on large-scale text mining.RDBridge：基于大规模文本挖掘的罕见病知识图谱。

Bioinformatics. 2023 Jul 1;39(7). doi: 10.1093/bioinformatics/btad440.

本文引用的文献

AMMU: A survey of transformer-based biomedical pretrained language models.基于变压器的生物医学预训练语言模型综述。

J Biomed Inform. 2022 Feb;126:103982. doi: 10.1016/j.jbi.2021.103982. Epub 2021 Dec 31.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT：一种用于生物医学文本挖掘的预训练生物医学语言表示模型。

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

Enhancing clinical concept extraction with contextual embeddings.利用上下文嵌入增强临床概念提取。

J Am Med Inform Assoc. 2019 Nov 1;26(11):1297-1304. doi: 10.1093/jamia/ocz096.

Transfer learning for biomedical named entity recognition with neural networks.基于神经网络的生物医学命名实体识别的迁移学习。

Bioinformatics. 2018 Dec 1;34(23):4087-4094. doi: 10.1093/bioinformatics/bty449.

BIOSSES: a semantic sentence similarity estimation system for the biomedical domain.BIOSSES：一种用于生物医学领域的语义句子相似度估计系统。

Bioinformatics. 2017 Jul 15;33(14):i49-i58. doi: 10.1093/bioinformatics/btx238.

BioCreative V CDR task corpus: a resource for chemical disease relation extraction.生物创意V化学疾病关系提取任务语料库：化学疾病关系提取的资源。

Database (Oxford). 2016 May 9;2016. doi: 10.1093/database/baw068. Print 2016.

Automatic semantic classification of scientific literature according to the hallmarks of cancer.根据癌症特征对科学文献进行自动语义分类。

Bioinformatics. 2016 Feb 1;32(3):432-40. doi: 10.1093/bioinformatics/btv585. Epub 2015 Oct 9.

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.BIOASQ大规模生物医学语义索引与问答竞赛概述。

BMC Bioinformatics. 2015 Apr 30;16:138. doi: 10.1186/s12859-015-0564-6.

Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research.从文本和大规模数据分析中提取基因与疾病之间的关系：对转化研究的启示。

BMC Bioinformatics. 2015 Feb 21;16:55. doi: 10.1186/s12859-015-0472-9.

NCBI disease corpus: a resource for disease name recognition and concept normalization.NCBI疾病语料库：一种用于疾病名称识别和概念规范化的资源。

J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于领域特定的 ALBERT 进行生物医学自然语言处理任务的基准测试。

Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献