• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

紧凑型生物医学变压器的有效性。

On the effectiveness of compact biomedical transformers.

机构信息

Department of Engineering Science, University of Oxford, Oxford, UK.

NLPie Research, Oxford, UK.

出版信息

Bioinformatics. 2023 Mar 1;39(3). doi: 10.1093/bioinformatics/btad103.

DOI:10.1093/bioinformatics/btad103
PMID:36825820
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10027428/
Abstract

MOTIVATION

Language models pre-trained on biomedical corpora, such as BioBERT, have recently shown promising results on downstream biomedical tasks. Many existing pre-trained models, on the other hand, are resource-intensive and computationally heavy owing to factors such as embedding size, hidden dimension and number of layers. The natural language processing community has developed numerous strategies to compress these models utilizing techniques such as pruning, quantization and knowledge distillation, resulting in models that are considerably faster, smaller and subsequently easier to use in practice. By the same token, in this article, we introduce six lightweight models, namely, BioDistilBERT, BioTinyBERT, BioMobileBERT, DistilBioBERT, TinyBioBERT and CompactBioBERT which are obtained either by knowledge distillation from a biomedical teacher or continual learning on the Pubmed dataset. We evaluate all of our models on three biomedical tasks and compare them with BioBERT-v1.1 to create the best efficient lightweight models that perform on par with their larger counterparts.

RESULTS

We trained six different models in total, with the largest model having 65 million in parameters and the smallest having 15 million; a far lower range of parameters compared with BioBERT's 110M. Based on our experiments on three different biomedical tasks, we found that models distilled from a biomedical teacher and models that have been additionally pre-trained on the PubMed dataset can retain up to 98.8% and 98.6% of the performance of the BioBERT-v1.1, respectively. Overall, our best model below 30 M parameters is BioMobileBERT, while our best models over 30 M parameters are DistilBioBERT and CompactBioBERT, which can keep up to 98.2% and 98.8% of the performance of the BioBERT-v1.1, respectively.

AVAILABILITY AND IMPLEMENTATION

Codes are available at: https://github.com/nlpie-research/Compact-Biomedical-Transformers. Trained models can be accessed at: https://huggingface.co/nlpie.

摘要

动机

最近,基于生物医学语料库(如 BioBERT)进行预训练的语言模型在下游生物医学任务中显示出了很有前景的结果。另一方面,许多现有的预训练模型由于嵌入大小、隐藏维度和层数等因素,资源密集且计算量大。自然语言处理社区已经开发了许多策略,利用剪枝、量化和知识蒸馏等技术来压缩这些模型,从而得到速度更快、体积更小、在实践中更容易使用的模型。同样,在本文中,我们介绍了六个轻量级模型,即通过知识蒸馏从生物医学教师或在 Pubmed 数据集上进行连续学习获得的 BioDistilBERT、BioTinyBERT、BioMobileBERT、DistilBioBERT、TinyBioBERT 和 CompactBioBERT。我们在三个生物医学任务上评估了所有模型,并将它们与 BioBERT-v1.1 进行比较,以创建性能与大型模型相当的最佳高效轻量级模型。

结果

我们总共训练了六个不同的模型,最大的模型有 6500 万个参数,最小的有 1500 万个;与 BioBERT 的 1.11 亿相比,参数范围要小得多。基于我们在三个不同生物医学任务上的实验,我们发现从生物医学教师那里蒸馏出来的模型和在 PubMed 数据集上额外预训练的模型分别可以保留 BioBERT-v1.1 性能的 98.8%和 98.6%。总的来说,我们参数低于 30M 的最佳模型是 BioMobileBERT,而我们参数高于 30M 的最佳模型是 DistilBioBERT 和 CompactBioBERT,它们分别可以保留 BioBERT-v1.1 性能的 98.2%和 98.8%。

可用性和实现

代码可在 https://github.com/nlpie-research/Compact-Biomedical-Transformers 上获得。训练好的模型可以在 https://huggingface.co/nlpie 上访问。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6bd1/10027428/be9b2d6f182f/btad103f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6bd1/10027428/1c6c312eb784/btad103f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6bd1/10027428/be9b2d6f182f/btad103f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6bd1/10027428/1c6c312eb784/btad103f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6bd1/10027428/be9b2d6f182f/btad103f2.jpg

相似文献

1
On the effectiveness of compact biomedical transformers.紧凑型生物医学变压器的有效性。
Bioinformatics. 2023 Mar 1;39(3). doi: 10.1093/bioinformatics/btad103.
2
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.
3
Factors that impact on the use of mechanical ventilation weaning protocols in critically ill adults and children: a qualitative evidence-synthesis.影响重症成人和儿童机械通气撤机方案使用的因素:一项定性证据综合分析
Cochrane Database Syst Rev. 2016 Oct 4;10(10):CD011812. doi: 10.1002/14651858.CD011812.pub2.
4
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
5
Surveillance of Barrett's oesophagus: exploring the uncertainty through systematic review, expert workshop and economic modelling.巴雷特食管的监测:通过系统评价、专家研讨会和经济模型探索不确定性
Health Technol Assess. 2006 Mar;10(8):1-142, iii-iv. doi: 10.3310/hta10080.
6
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病:网络荟萃分析。
Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.
7
Short-Term Memory Impairment短期记忆障碍
8
Sexual Harassment and Prevention Training性骚扰与预防培训
9
Behavioral interventions to reduce risk for sexual transmission of HIV among men who have sex with men.降低男男性行为者中艾滋病毒性传播风险的行为干预措施。
Cochrane Database Syst Rev. 2008 Jul 16(3):CD001230. doi: 10.1002/14651858.CD001230.pub2.
10
Predicting Drug-Side Effect Relationships From Parametric Knowledge Embedded in Biomedical BERT Models: Methodological Study With a Natural Language Processing Approach.从生物医学BERT模型中嵌入的参数知识预测药物副作用关系:一种自然语言处理方法的方法学研究
JMIR Med Inform. 2025 Jul 10;13:e67513. doi: 10.2196/67513.

引用本文的文献

1
Automated Data Harmonization in Clinical Research: Natural Language Processing Approach.临床研究中的自动化数据协调:自然语言处理方法
JMIR Form Res. 2025 Aug 27;9:e75608. doi: 10.2196/75608.
2
InertDB as a generative AI-expanded resource of biologically inactive small molecules from PubChem.InertDB作为一种通过生成式人工智能扩展的来自PubChem的生物无活性小分子资源。
J Cheminform. 2025 Apr 10;17(1):49. doi: 10.1186/s13321-025-00999-1.
3
Year 2023 in Biomedical Natural Language Processing: a Tribute to Large Language Models and Generative AI.

本文引用的文献

1
SECNLP: A survey of embeddings in clinical natural language processing.SECNLP:临床自然语言处理中的嵌入技术综述。
J Biomed Inform. 2020 Jan;101:103323. doi: 10.1016/j.jbi.2019.103323. Epub 2019 Nov 8.
2
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
3
Continual lifelong learning with neural networks: A review.神经网络的持续终身学习:综述。
2023年生物医学自然语言处理领域:向大语言模型和生成式人工智能致敬。
Yearb Med Inform. 2024 Aug;33(1):241-248. doi: 10.1055/s-0044-1800751. Epub 2025 Apr 8.
4
PromptLink: Leveraging Large Language Models for Cross-Source Biomedical Concept Linking.PromptLink:利用大语言模型进行跨源生物医学概念链接。
Int ACM SIGIR Conf Res Dev Inf Retr. 2024 Jul;2024:2589-2593. doi: 10.1145/3626772.3657904. Epub 2024 Jul 11.
5
PretoxTM: a text mining system for extracting treatment-related findings from preclinical toxicology reports.PretoxTM:一种用于从临床前毒理学报告中提取治疗相关发现的文本挖掘系统。
J Cheminform. 2025 Feb 3;17(1):15. doi: 10.1186/s13321-024-00925-x.
6
Needle in a haystack: Harnessing AI in drug patent searches and prediction.大海捞针:在药物专利检索与预测中利用人工智能
PLoS One. 2024 Dec 2;19(12):e0311238. doi: 10.1371/journal.pone.0311238. eCollection 2024.
7
Lightweight transformers for clinical natural language processing.用于临床自然语言处理的轻量级变压器
Nat Lang Eng. 2024 Sep;30(5):887-914. doi: 10.1017/S1351324923000542. Epub 2024 Jan 12.
8
Exploring the effectiveness of instruction tuning in biomedical language processing.探索指令调优在生物医学语言处理中的有效性。
Artif Intell Med. 2024 Dec;158:103007. doi: 10.1016/j.artmed.2024.103007. Epub 2024 Nov 7.
9
A scoping review of large language model based approaches for information extraction from radiology reports.基于大语言模型从放射学报告中提取信息的方法的范围综述。
NPJ Digit Med. 2024 Aug 24;7(1):222. doi: 10.1038/s41746-024-01219-0.
10
Biomedical named entity recognition based on multi-cross attention feature fusion.基于多交叉注意力特征融合的生物医学命名实体识别。
PLoS One. 2024 May 28;19(5):e0304329. doi: 10.1371/journal.pone.0304329. eCollection 2024.
Neural Netw. 2019 May;113:54-71. doi: 10.1016/j.neunet.2019.01.012. Epub 2019 Feb 6.
4
BioCreative V CDR task corpus: a resource for chemical disease relation extraction.生物创意V化学疾病关系提取任务语料库:化学疾病关系提取的资源。
Database (Oxford). 2016 May 9;2016. doi: 10.1093/database/baw068. Print 2016.
5
An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.BIOASQ大规模生物医学语义索引与问答竞赛概述。
BMC Bioinformatics. 2015 Apr 30;16:138. doi: 10.1186/s12859-015-0564-6.
6
Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research.从文本和大规模数据分析中提取基因与疾病之间的关系:对转化研究的启示。
BMC Bioinformatics. 2015 Feb 21;16:55. doi: 10.1186/s12859-015-0472-9.
7
The CHEMDNER corpus of chemicals and drugs and its annotation principles.CHEMDNER 化学物质和药物语料库及其标注原则。
J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S2. doi: 10.1186/1758-2946-7-S1-S2. eCollection 2015.
8
NCBI disease corpus: a resource for disease name recognition and concept normalization.NCBI疾病语料库:一种用于疾病名称识别和概念规范化的资源。
J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.
9
The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text.用于快速准确识别文本中分类名称的物种和生物体资源。
PLoS One. 2013 Jun 18;8(6):e65390. doi: 10.1371/journal.pone.0065390. Print 2013.
10
LINNAEUS: a species name identification system for biomedical literature.林奈氏:生物医学文献的物种名称识别系统。
BMC Bioinformatics. 2010 Feb 11;11:85. doi: 10.1186/1471-2105-11-85.