Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, PA, USA.
College of Computing and Informatics, Drexel University, Philadelphia, PA, USA.
Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad226.
Human prescription drug labeling contains a summary of the essential scientific information needed for the safe and effective use of the drug and includes the Prescribing Information, FDA-approved patient labeling (Medication Guides, Patient Package Inserts and/or Instructions for Use), and/or carton and container labeling. Drug labeling contains critical information about drug products, such as pharmacokinetics and adverse events. Automatic information extraction from drug labels may facilitate finding the adverse reaction of the drugs or finding the interaction of one drug with another drug. Natural language processing (NLP) techniques, especially recently developed Bidirectional Encoder Representations from Transformers (BERT), have exhibited exceptional merits in text-based information extraction. A common paradigm in training BERT is to pretrain the model on large unlabeled generic language corpora, so that the model learns the distribution of the words in the language, and then fine-tune on a downstream task. In this paper, first, we show the uniqueness of language used in drug labels, which therefore cannot be optimally handled by other BERT models. Then, we present the developed PharmBERT, which is a BERT model specifically pretrained on the drug labels (publicly available at Hugging Face). We demonstrate that our model outperforms the vanilla BERT, ClinicalBERT and BioBERT in multiple NLP tasks in the drug label domain. Moreover, how the domain-specific pretraining has contributed to the superior performance of PharmBERT is demonstrated by analyzing different layers of PharmBERT, and more insight into how it understands different linguistic aspects of the data is gained.
人用药物标签包含了安全有效使用药物所需的基本科学信息摘要,包括说明书、FDA 批准的患者标签(用药指南、患者用药须知和/或使用说明)和/或纸盒和容器标签。药物标签包含了药物产品的关键信息,如药代动力学和不良反应。从药物标签中自动提取信息可以方便地发现药物的不良反应,或者发现一种药物与另一种药物的相互作用。自然语言处理(NLP)技术,尤其是最近开发的基于 Transformer 的双向编码器表示(BERT),在基于文本的信息提取方面表现出了卓越的优点。在训练 BERT 时,一个常见的范例是在大型无标签通用语言语料库上对模型进行预训练,使模型学习语言中单词的分布,然后在下游任务上进行微调。在本文中,我们首先展示了药物标签中使用的语言的独特性,因此其他 BERT 模型无法对其进行最佳处理。然后,我们提出了 PharmBERT,这是一个专门在药物标签上进行预训练的 BERT 模型(可在 Hugging Face 上公开获得)。我们证明,我们的模型在药物标签领域的多个 NLP 任务中优于 vanilla BERT、ClinicalBERT 和 BioBERT。此外,通过分析 PharmBERT 的不同层,展示了特定于领域的预训练如何有助于提高 PharmBERT 的性能,并获得了更多关于它如何理解数据不同语言方面的见解。