Department of Biomedical Informatics, Ohio State University, Columbus, OH, 43210, USA.
J Biomed Semantics. 2023 May 30;14(1):5. doi: 10.1186/s13326-023-00287-7.
Drug-drug interaction (DDI) information retrieval (IR) is an important natural language process (NLP) task from the PubMed literature. For the first time, active learning (AL) is studied in DDI IR analysis. DDI IR analysis from PubMed abstracts faces the challenges of relatively small positive DDI samples among overwhelmingly large negative samples. Random negative sampling and positive sampling are purposely designed to improve the efficiency of AL analysis. The consistency of random negative sampling and positive sampling is shown in the paper.
PubMed abstracts are divided into two pools. Screened pool contains all abstracts that pass the DDI keywords query in PubMed, while unscreened pool includes all the other abstracts. At a prespecified recall rate of 0.95, DDI IR analysis precision is evaluated and compared. In screened pool IR analysis using supporting vector machine (SVM), similarity sampling plus uncertainty sampling improves the precision over uncertainty sampling, from 0.89 to 0.92 respectively. In the unscreened pool IR analysis, the integrated random negative sampling, positive sampling, and similarity sampling improve the precision over uncertainty sampling along, from 0.72 to 0.81 respectively. When we change the SVM to a deep learning method, all sampling schemes consistently improve DDI AL analysis in both screened pool and unscreened pool. Deep learning has significant improvement of precision over SVM, 0.96 vs. 0.92 in screened pool, and 0.90 vs. 0.81 in the unscreened pool, respectively.
By integrating various sampling schemes and deep learning algorithms into AL, the DDI IR analysis from literature is significantly improved. The random negative sampling and positive sampling are highly effective methods in improving AL analysis where the positive and negative samples are extremely imbalanced.
药物-药物相互作用(DDI)信息检索(IR)是从 PubMed 文献中进行的一项重要自然语言处理(NLP)任务。本文首次研究了主动学习(AL)在 DDI IR 分析中的应用。从 PubMed 摘要中进行 DDI IR 分析面临的挑战是,阳性 DDI 样本在大量阴性样本中相对较少。本文设计了随机负采样和正采样来提高 AL 分析的效率。本文展示了随机负采样和正采样的一致性。
将 PubMed 摘要分为两个池。筛选池包含所有在 PubMed 中通过 DDI 关键字查询的摘要,而未筛选池则包含所有其他摘要。在指定的召回率为 0.95 时,评估并比较了 DDI IR 分析的精度。在使用支持向量机(SVM)的筛选池 IR 分析中,相似性采样加不确定性采样将精度从 0.89 提高到 0.92。在未筛选池 IR 分析中,综合随机负采样、正采样和相似性采样将精度从 0.72 提高到 0.81。当我们将 SVM 改为深度学习方法时,所有采样方案在筛选池和未筛选池中的 DDI AL 分析中都得到了一致的提高。深度学习在精度上对 SVM 有显著的提高,在筛选池中的精度为 0.96 对 0.92,在未筛选池中的精度为 0.90 对 0.81。
通过将各种采样方案和深度学习算法集成到 AL 中,大大提高了文献中的 DDI IR 分析。随机负采样和正采样是在正负样本极不平衡的情况下提高 AL 分析效率的有效方法。