PharmBERT：一种针对药物标签的特定领域 BERT 模型。

PharmBERT: a domain-specific BERT model for drug labels.

机构信息

Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, PA, USA.

College of Computing and Informatics, Drexel University, Philadelphia, PA, USA.

出版信息

Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad226.

DOI:10.1093/bib/bbad226

PMID:37317617

Abstract

Human prescription drug labeling contains a summary of the essential scientific information needed for the safe and effective use of the drug and includes the Prescribing Information, FDA-approved patient labeling (Medication Guides, Patient Package Inserts and/or Instructions for Use), and/or carton and container labeling. Drug labeling contains critical information about drug products, such as pharmacokinetics and adverse events. Automatic information extraction from drug labels may facilitate finding the adverse reaction of the drugs or finding the interaction of one drug with another drug. Natural language processing (NLP) techniques, especially recently developed Bidirectional Encoder Representations from Transformers (BERT), have exhibited exceptional merits in text-based information extraction. A common paradigm in training BERT is to pretrain the model on large unlabeled generic language corpora, so that the model learns the distribution of the words in the language, and then fine-tune on a downstream task. In this paper, first, we show the uniqueness of language used in drug labels, which therefore cannot be optimally handled by other BERT models. Then, we present the developed PharmBERT, which is a BERT model specifically pretrained on the drug labels (publicly available at Hugging Face). We demonstrate that our model outperforms the vanilla BERT, ClinicalBERT and BioBERT in multiple NLP tasks in the drug label domain. Moreover, how the domain-specific pretraining has contributed to the superior performance of PharmBERT is demonstrated by analyzing different layers of PharmBERT, and more insight into how it understands different linguistic aspects of the data is gained.

摘要

人用药物标签包含了安全有效使用药物所需的基本科学信息摘要，包括说明书、FDA 批准的患者标签（用药指南、患者用药须知和/或使用说明）和/或纸盒和容器标签。药物标签包含了药物产品的关键信息，如药代动力学和不良反应。从药物标签中自动提取信息可以方便地发现药物的不良反应，或者发现一种药物与另一种药物的相互作用。自然语言处理（NLP）技术，尤其是最近开发的基于 Transformer 的双向编码器表示（BERT），在基于文本的信息提取方面表现出了卓越的优点。在训练 BERT 时，一个常见的范例是在大型无标签通用语言语料库上对模型进行预训练，使模型学习语言中单词的分布，然后在下游任务上进行微调。在本文中，我们首先展示了药物标签中使用的语言的独特性，因此其他 BERT 模型无法对其进行最佳处理。然后，我们提出了 PharmBERT，这是一个专门在药物标签上进行预训练的 BERT 模型（可在 Hugging Face 上公开获得）。我们证明，我们的模型在药物标签领域的多个 NLP 任务中优于 vanilla BERT、ClinicalBERT 和 BioBERT。此外，通过分析 PharmBERT 的不同层，展示了特定于领域的预训练如何有助于提高 PharmBERT 的性能，并获得了更多关于它如何理解数据不同语言方面的见解。

相似文献

PharmBERT: a domain-specific BERT model for drug labels.

Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad226.

Methods Mol Biol. 2022;2496:221-235. doi: 10.1007/978-1-0716-2305-3_12.

When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification.

BMC Med Inform Decis Mak. 2022 Apr 5;21(Suppl 9):377. doi: 10.1186/s12911-022-01829-2.

Fine-tuning BERT for automatic ADME semantic labeling in FDA drug labeling to enhance product-specific guidance assessment.

J Biomed Inform. 2023 Feb;138:104285. doi: 10.1016/j.jbi.2023.104285. Epub 2023 Jan 9.

Oversampling effect in pretraining for bidirectional encoder representations from transformers (BERT) to localize medical BERT and enhance biomedical BERT.

Artif Intell Med. 2024 Jul;153:102889. doi: 10.1016/j.artmed.2024.102889. Epub 2024 May 5.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

RxBERT: Enhancing drug labeling text mining and analysis with AI language modeling.

Exp Biol Med (Maywood). 2023 Nov;248(21):1937-1943. doi: 10.1177/15353702231220669. Epub 2024 Jan 2.

Extracting comprehensive clinical information for breast cancer using deep learning methods.

Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2.

Use of BERT (Bidirectional Encoder Representations from Transformers)-Based Deep Learning Method for Extracting Evidences in Chinese Radiology Reports: Development of a Computer-Aided Liver Cancer Diagnosis Framework.

J Med Internet Res. 2021 Jan 12;23(1):e19689. doi: 10.2196/19689.

Drug knowledge discovery via multi-task learning and pre-trained models.

BMC Med Inform Decis Mak. 2021 Nov 16;21(Suppl 9):251. doi: 10.1186/s12911-021-01614-7.

引用本文的文献

Leveraging Large Language Models in Extracting Drug Safety Information from Prescription Drug Labels.

Drug Saf. 2025 Sep 2. doi: 10.1007/s40264-025-01594-x.

Automatic extraction of SmPC document for IDMP data model construction using foundation LLM and RAG: a preliminary experiment for pharmaceutical regulatory affairs.

Front Med (Lausanne). 2025 Aug 13;12:1598979. doi: 10.3389/fmed.2025.1598979. eCollection 2025.

Bridging artificial intelligence and biological sciences: a comprehensive review of large language models in bioinformatics.

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf357.

Predicting Drug-Side Effect Relationships From Parametric Knowledge Embedded in Biomedical BERT Models: Methodological Study With a Natural Language Processing Approach.

JMIR Med Inform. 2025 Jul 10;13:e67513. doi: 10.2196/67513.

Artificial Intelligence Models and Tools for the Assessment of Drug-Herb Interactions.

Pharmaceuticals (Basel). 2025 Feb 20;18(3):282. doi: 10.3390/ph18030282.

Machine learning tools match physician accuracy in multilingual text annotation.

Sci Rep. 2025 Feb 14;15(1):5487. doi: 10.1038/s41598-025-89754-y.

AI In Action: Redefining Drug Discovery and Development.

Clin Transl Sci. 2025 Feb;18(2):e70149. doi: 10.1111/cts.70149.

Artificial intelligence in drug development.

Nat Med. 2025 Jan;31(1):45-59. doi: 10.1038/s41591-024-03434-4. Epub 2025 Jan 20.

Enhancing sentiment and intent analysis in public health via fine-tuned Large Language Models on tobacco and e-cigarette-related tweets.

Front Big Data. 2024 Nov 28;7:1501154. doi: 10.3389/fdata.2024.1501154. eCollection 2024.

Large language models and their applications in bioinformatics.

Comput Struct Biotechnol J. 2024 Oct 5;23:3498-3505. doi: 10.1016/j.csbj.2024.09.031. eCollection 2024 Dec.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

PharmBERT：一种针对药物标签的特定领域 BERT 模型。

PharmBERT: a domain-specific BERT model for drug labels.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献