Bhatt Arjun, Roberts Ruth, Chen Xi, Li Ting, Connor Skylar, Hatim Qais, Mikailov Mike, Tong Weida, Liu Zhichao
Division of Bioinformatics & Biostatistics, National Center for Toxicological Research, Food and Drug Administration, Jefferson, AR, United States.
Dartmouth College, Hanover, NH, United States.
Front Artif Intell. 2021 Aug 2;4:711467. doi: 10.3389/frai.2021.711467. eCollection 2021.
Drug labeling contains an 'INDICATIONS AND USAGE' that provides vital information to support clinical decision making and regulatory management. Effective extraction of drug indication information from free-text based resources could facilitate drug repositioning projects and help collect real-world evidence in support of secondary use of approved medicines. To enable AI-powered language models for the extraction of drug indication information, we used manual reading and curation to develop a rug ndication lassification and ncyclopedia (DICE) based on FDA approved human prescription drug labeling. A DICE scheme with 7,231 sentences categorized into five classes (indications, contradictions, side effects, usage instructions, and clinical observations) was developed. To further elucidate the utility of the DICE, we developed nine different AI-based classifiers for the prediction of indications based on the developed DICE to comprehensively assess their performance. We found that the transformer-based language models yielded an average MCC of 0.887, outperforming the word embedding-based Bidirectional long short-term memory (BiLSTM) models (0.862) with a 2.82% improvement on the test set. The best classifiers were also used to extract drug indication information in DrugBank and achieved a high enrichment rate (>0.930) for this task. We found that domain-specific training could provide more explainable models without performance sacrifices and better generalization for external validation datasets. Altogether, the proposed DICE could be a standard resource for the development and evaluation of task-specific AI-powered, natural language processing (NLP) models.
药品标签包含一个“适应症与用法”部分,该部分提供了支持临床决策和监管管理的重要信息。从基于自由文本的资源中有效提取药品适应症信息,有助于推进药物重新定位项目,并有助于收集真实世界证据以支持已批准药物的二次使用。为了利用人工智能驱动的语言模型来提取药品适应症信息,我们通过人工阅读和整理,基于美国食品药品监督管理局(FDA)批准的人类处方药标签,开发了一个药品适应症分类与百科全书(DICE)。我们制定了一个DICE方案,将7231个句子分为五类(适应症、禁忌、副作用、用法说明和临床观察)。为了进一步阐明DICE的效用,我们基于已开发的DICE,开发了九种不同的基于人工智能的分类器来预测适应症,以全面评估它们的性能。我们发现,基于Transformer的语言模型平均马修斯相关系数(MCC)为0.887,优于基于词嵌入的双向长短期记忆(BiLSTM)模型(0.862),在测试集上提高了2.82%。最佳分类器还用于在DrugBank中提取药品适应症信息,并且在这项任务中实现了较高的富集率(>0.930)。我们发现,特定领域的训练可以提供更具可解释性的模型,而不会牺牲性能,并且对外部验证数据集具有更好的泛化能力。总之,所提出的DICE可以成为开发和评估特定任务的人工智能驱动的自然语言处理(NLP)模型的标准资源。