Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR.
School of Artificial Intelligence, Jilin University, China.
Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab451.
The rapid growth in literature accumulates diverse and yet comprehensive biomedical knowledge hidden to be mined such as drug interactions. However, it is difficult to extract the heterogeneous knowledge to retrieve or even discover the latest and novel knowledge in an efficient manner. To address such a problem, we propose EGFI for extracting and consolidating drug interactions from large-scale medical literature text data. Specifically, EGFI consists of two parts: classification and generation. In the classification part, EGFI encompasses the language model BioBERT which has been comprehensively pretrained on biomedical corpus. In particular, we propose the multihead self-attention mechanism and packed BiGRU to fuse multiple semantic information for rigorous context modeling. In the generation part, EGFI utilizes another pretrained language model BioGPT-2 where the generation sentences are selected based on filtering rules.
We evaluated the classification part on 'DDIs 2013' dataset and 'DTIs' dataset, achieving the F1 scores of 0.842 and 0.720 respectively. Moreover, we applied the classification part to distinguish high-quality generated sentences and verified with the existing growth truth to confirm the filtered sentences. The generated sentences that are not recorded in DrugBank and DDIs 2013 dataset demonstrated the potential of EGFI to identify novel drug relationships.
Source code are publicly available at https://github.com/Layne-Huang/EGFI.
文献的快速增长积累了各种综合的生物医学知识,这些知识有待挖掘,例如药物相互作用。然而,很难提取异构知识以高效地检索甚至发现最新和新颖的知识。为了解决这个问题,我们提出了 EGFI,用于从大规模医学文献文本数据中提取和整合药物相互作用。具体来说,EGFI 由两部分组成:分类和生成。在分类部分,EGFI 包含了经过全面生物医学语料库预训练的语言模型 BioBERT。特别是,我们提出了多头自注意力机制和打包的 BiGRU,以融合多种语义信息进行严格的上下文建模。在生成部分,EGFI 利用了另一个经过预训练的语言模型 BioGPT-2,根据过滤规则选择生成的句子。
我们在“DDIs 2013”数据集和“DTIs”数据集上评估了分类部分,分别获得了 0.842 和 0.720 的 F1 分数。此外,我们将分类部分应用于区分高质量生成的句子,并通过与现有增长事实进行验证,以确认过滤后的句子。在 DrugBank 和 DDIs 2013 数据集未记录的生成句子表明 EGFI 有潜力识别新的药物关系。
源代码可在 https://github.com/Layne-Huang/EGFI 上公开获取。