基于BERT的药品标签文档自然语言处理：药物性肝损伤风险分类的案例研究

BERT-Based Natural Language Processing of Drug Labeling Documents: A Case Study for Classifying Drug-Induced Liver Injury Risk.

作者信息

Wu Yue, Liu Zhichao, Wu Leihong, Chen Minjun, Tong Weida

机构信息

Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, United States Food and Drug Administration, Jefferson, AR, United States.

出版信息

Front Artif Intell. 2021 Dec 6;4:729834. doi: 10.3389/frai.2021.729834. eCollection 2021.

DOI:10.3389/frai.2021.729834

PMID:34939028

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8685544/

Abstract

The United States Food and Drug Administration (FDA) regulates a broad range of consumer products, which account for about 25% of the United States market. The FDA regulatory activities often involve producing and reading of a large number of documents, which is time consuming and labor intensive. To support regulatory science at FDA, we evaluated artificial intelligence (AI)-based natural language processing (NLP) of regulatory documents for text classification and compared deep learning-based models with a conventional keywords-based model. FDA drug labeling documents were used as a representative regulatory data source to classify drug-induced liver injury (DILI) risk by employing the state-of-the-art language model BERT. The resulting NLP-DILI classification model was statistically validated with both internal and external validation procedures and applied to the labeling data from the European Medicines Agency (EMA) for cross-agency application. The NLP-DILI model developed using FDA labeling documents and evaluated by cross-validations in this study showed remarkable performance in DILI classification with a recall of 1 and a precision of 0.78. When cross-agency data were used to validate the model, the performance remained comparable, demonstrating that the model was portable across agencies. Results also suggested that the model was able to capture the semantic meanings of sentences in drug labeling. Deep learning-based NLP models performed well in DILI classification of drug labeling documents and learned the meanings of complex text in drug labeling. This proof-of-concept work demonstrated that using AI technologies to assist regulatory activities is a promising approach to modernize and advance regulatory science.

摘要

美国食品药品监督管理局（FDA）监管范围广泛的消费品，这些产品约占美国市场的25%。FDA的监管活动通常涉及大量文件的制作和阅读，既耗时又耗力。为了支持FDA的监管科学，我们评估了基于人工智能（AI）的自然语言处理（NLP）技术对监管文件进行文本分类，并将基于深度学习的模型与传统的基于关键词的模型进行了比较。FDA药品标签文件被用作代表性的监管数据源，通过使用最先进的语言模型BERT对药物性肝损伤（DILI）风险进行分类。所得的NLP-DILI分类模型通过内部和外部验证程序进行了统计学验证，并应用于欧洲药品管理局（EMA）的标签数据以进行跨机构应用。本研究中使用FDA标签文件开发并通过交叉验证评估的NLP-DILI模型在DILI分类中表现出色，召回率为1，精确率为0.78。当使用跨机构数据验证该模型时，性能仍然相当，表明该模型可在不同机构间移植。结果还表明，该模型能够捕捉药品标签中句子的语义。基于深度学习的NLP模型在药品标签文件的DILI分类中表现良好，并能够理解药品标签中复杂文本的含义。这项概念验证工作表明，使用人工智能技术协助监管活动是使监管科学现代化和取得进展的一种有前景的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2aa2/8685544/f1a9ed202696/frai-04-729834-g001.jpg

相似文献

BERT-Based Natural Language Processing of Drug Labeling Documents: A Case Study for Classifying Drug-Induced Liver Injury Risk.基于BERT的药品标签文档自然语言处理：药物性肝损伤风险分类的案例研究

Front Artif Intell. 2021 Dec 6;4:729834. doi: 10.3389/frai.2021.729834. eCollection 2021.

RxBERT: Enhancing drug labeling text mining and analysis with AI language modeling.RxBERT：利用人工智能语言模型增强药物标签文本挖掘和分析。

Exp Biol Med (Maywood). 2023 Nov;248(21):1937-1943. doi: 10.1177/15353702231220669. Epub 2024 Jan 2.

Classifying Free Texts Into Predefined Sections Using AI in Regulatory Documents: A Case Study with Drug Labeling Documents.使用人工智能将自由文本分类到预定义的部分：以药物标签文件为例的监管文件研究。

Chem Res Toxicol. 2023 Aug 21;36(8):1290-1299. doi: 10.1021/acs.chemrestox.3c00028. Epub 2023 Jul 24.

Fine-tuning BERT for automatic ADME semantic labeling in FDA drug labeling to enhance product-specific guidance assessment.在FDA药品标签中微调BERT以进行自动ADME语义标注，以加强特定产品的指导评估。

J Biomed Inform. 2023 Feb;138:104285. doi: 10.1016/j.jbi.2023.104285. Epub 2023 Jan 9.

DILI : An AI-Based Classifier to Search for Drug-Induced Liver Injury Literature.DILI：一种用于检索药物性肝损伤文献的基于人工智能的分类器。

Front Genet. 2022 Jun 29;13:867946. doi: 10.3389/fgene.2022.867946. eCollection 2022.

Automatic text classification of drug-induced liver injury using document-term matrix and XGBoost.使用文档-词矩阵和XGBoost对药物性肝损伤进行自动文本分类

Front Artif Intell. 2024 Jun 3;7:1401810. doi: 10.3389/frai.2024.1401810. eCollection 2024.

When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification.当 BERT 遇见比尔博：预训练语言模型在疾病分类上的学习曲线分析。

BMC Med Inform Decis Mak. 2022 Apr 5;21(Suppl 9):377. doi: 10.1186/s12911-022-01829-2.

Toward an Explainable Large Language Model for the Automatic Identification of the Drug-Induced Liver Injury Literature.迈向可解释的大型语言模型，用于自动识别药物性肝损伤文献。

Chem Res Toxicol. 2024 Sep 16;37(9):1524-1534. doi: 10.1021/acs.chemrestox.4c00134. Epub 2024 Aug 27.

Information Extraction From FDA Drug Labeling to Enhance Product-Specific Guidance Assessment Using Natural Language Processing.利用自然语言处理技术从美国食品药品监督管理局（FDA）药品标签中提取信息以加强特定产品指导评估

Front Res Metr Anal. 2021 Jun 10;6:670006. doi: 10.3389/frma.2021.670006. eCollection 2021.

Evaluation of Natural Language Processing (NLP) systems to annotate drug product labeling with MedDRA terminology.评估自然语言处理 (NLP) 系统，以使用 MedDRA 术语对药品标签进行注释。

J Biomed Inform. 2018 Jul;83:73-86. doi: 10.1016/j.jbi.2018.05.019. Epub 2018 Jun 1.

引用本文的文献

Comparative analysis of natural language processing methodologies for classifying computed tomography enterography reports in Crohn's disease patients.用于对克罗恩病患者的计算机断层扫描小肠造影报告进行分类的自然语言处理方法的比较分析。

NPJ Digit Med. 2025 May 30;8(1):324. doi: 10.1038/s41746-025-01729-5.

Leveraging FDA Labeling Documents and Large Language Model to Enhance Annotation, Profiling, and Classification of Drug Adverse Events with AskFDALabel.利用美国食品药品监督管理局（FDA）的标签文件和大语言模型，通过AskFDALabel增强药物不良事件的注释、剖析和分类。

Drug Saf. 2025 Jun;48(6):655-665. doi: 10.1007/s40264-025-01520-1. Epub 2025 Feb 20.

Innovation and challenges of artificial intelligence technology in personalized healthcare.人工智能技术在个性化医疗保健中的创新与挑战。

Sci Rep. 2024 Aug 16;14(1):18994. doi: 10.1038/s41598-024-70073-7.

Automatic text classification of drug-induced liver injury using document-term matrix and XGBoost.使用文档-词矩阵和XGBoost对药物性肝损伤进行自动文本分类

Front Artif Intell. 2024 Jun 3;7:1401810. doi: 10.3389/frai.2024.1401810. eCollection 2024.

PolyAMiner-Bulk is a deep learning-based algorithm that decodes alternative polyadenylation dynamics from bulk RNA-seq data.PolyAMiner-Bulk 是一种基于深度学习的算法，可从批量 RNA-seq 数据中解码可变多聚腺苷酸化动态。

Cell Rep Methods. 2024 Feb 26;4(2):100707. doi: 10.1016/j.crmeth.2024.100707. Epub 2024 Feb 6.

Computational models for predicting liver toxicity in the deep learning era.深度学习时代预测肝脏毒性的计算模型。

Front Toxicol. 2024 Jan 19;5:1340860. doi: 10.3389/ftox.2023.1340860. eCollection 2023.

RxBERT: Enhancing drug labeling text mining and analysis with AI language modeling.RxBERT：利用人工智能语言模型增强药物标签文本挖掘和分析。

Exp Biol Med (Maywood). 2023 Nov;248(21):1937-1943. doi: 10.1177/15353702231220669. Epub 2024 Jan 2.

dialogi: Utilising NLP With Chemical and Disease Similarities to Drive the Identification of Drug-Induced Liver Injury Literature.对话：利用具有化学和疾病相似性的自然语言处理技术推动药物性肝损伤文献的识别

Front Genet. 2022 Aug 9;13:894209. doi: 10.3389/fgene.2022.894209. eCollection 2022.

NeuroCORD: A Language Model to Facilitate COVID-19-Associated Neurological Disorder Studies.神经 Cord：一种促进 COVID-19 相关神经紊乱研究的语言模型。

Int J Environ Res Public Health. 2022 Aug 12;19(16):9974. doi: 10.3390/ijerph19169974.

DILI : An AI-Based Classifier to Search for Drug-Induced Liver Injury Literature.DILI：一种用于检索药物性肝损伤文献的基于人工智能的分类器。

Front Genet. 2022 Jun 29;13:867946. doi: 10.3389/fgene.2022.867946. eCollection 2022.

本文引用的文献

Improving Document-Level Sentiment Classification Using Importance of Sentences.利用句子重要性改进文档级情感分类

Entropy (Basel). 2020 Nov 25;22(12):1336. doi: 10.3390/e22121336.

FDALabel for drug repurposing studies and beyond.用于药物再利用研究及其他方面的美国食品药品监督管理局标签。

Nat Biotechnol. 2020 Dec;38(12):1378-1379. doi: 10.1038/s41587-020-00751-0.

Drug induced liver injury: an update.药物性肝损伤：最新进展。

Arch Toxicol. 2020 Oct;94(10):3381-3407. doi: 10.1007/s00204-020-02885-1. Epub 2020 Aug 27.

Drug-induced liver injury.药物性肝损伤。

Nat Rev Dis Primers. 2019 Aug 22;5(1):58. doi: 10.1038/s41572-019-0105-0.

Artificial Intelligence for Drug Toxicity and Safety.人工智能在药物毒性和安全性方面的应用。

Trends Pharmacol Sci. 2019 Sep;40(9):624-635. doi: 10.1016/j.tips.2019.07.005. Epub 2019 Aug 2.

Advancing Drug Discovery via Artificial Intelligence.人工智能推动药物发现。

Trends Pharmacol Sci. 2019 Aug;40(8):592-604. doi: 10.1016/j.tips.2019.06.004. Epub 2019 Jul 15.

Artificial intelligence in healthcare.人工智能在医疗保健领域的应用。

Nat Biomed Eng. 2018 Oct;2(10):719-731. doi: 10.1038/s41551-018-0305-z. Epub 2018 Oct 10.

Study of serious adverse drug reactions using FDA-approved drug labeling and MedDRA.使用 FDA 批准的药品标签和 MedDRA 研究严重药物不良反应。

BMC Bioinformatics. 2019 Mar 14;20(Suppl 2):97. doi: 10.1186/s12859-019-2628-5.

A dataset of 200 structured product labels annotated for adverse drug reactions.一个标注了 200 个结构产品标签的药物不良反应数据集。

Sci Data. 2018 Jan 30;5:180001. doi: 10.1038/sdata.2018.1.

DILIrank: the largest reference drug list ranked by the risk for developing drug-induced liver injury in humans.DILIrank：按人类发生药物性肝损伤风险排序的最大参考药物清单。

Drug Discov Today. 2016 Apr;21(4):648-53. doi: 10.1016/j.drudis.2016.02.015. Epub 2016 Mar 3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于BERT的药品标签文档自然语言处理：药物性肝损伤风险分类的案例研究

BERT-Based Natural Language Processing of Drug Labeling Documents: A Case Study for Classifying Drug-Induced Liver Injury Risk.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献