Xie Weixin, Wang Limei, Cheng Qi, Wang Xueying, Wang Ying, Bi Hongyuan, He Bo, Feng Weixing
Institute of Intelligent System and Bioinformatics, College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin, China.
Key Laboratory of Tropical Translational Medicine of Ministry of Education, College of Biomedical Information and Engineering, Hainan Medical University, Haikou, China.
Front Pharmacol. 2021 Apr 23;11:582470. doi: 10.3389/fphar.2020.582470. eCollection 2020.
Clinical drug-drug interactions (DDIs) have been a major cause for not only medical error but also adverse drug events (ADEs). The published literature on DDI clinical toxicity continues to grow significantly, and high-performance DDI information retrieval (IR) text mining methods are in high demand. The effectiveness of IR and its machine learning (ML) algorithm depends on the availability of a large amount of training and validation data that have been manually reviewed and annotated. In this study, we investigated how active learning (AL) might improve ML performance in clinical safety DDI IR analysis. We recognized that a direct application of AL would not address several primary challenges in DDI IR from the literature. For instance, the vast majority of abstracts in PubMed will be negative, existing positive and negative labeled samples do not represent the general sample distributions, and potentially biased samples may arise during uncertainty sampling in an AL algorithm. Therefore, we developed several novel sampling and ML schemes to improve AL performance in DDI IR analysis. In particular, random negative sampling was added as a part of AL since it has no expanse in the manual data label. We also used two ML algorithms in an AL process to differentiate random negative samples from manually labeled negative samples, and updated both the training and validation samples during the AL process to avoid or reduce biased sampling. Two supervised ML algorithms, support vector machine (SVM) and logistic regression (LR), were used to investigate the consistency of our proposed AL algorithm. Because the ultimate goal of clinical safety DDI IR is to retrieve all DDI toxicity-relevant abstracts, a recall rate of 0.99 was set in developing the AL methods. When we used our newly proposed AL method with SVM, the precision in differentiating the positive samples from manually labeled negative samples improved from 0.45 in the first round to 0.83 in the second round, and the precision in differentiating the positive samples from random negative samples improved from 0.70 to 0.82 in the first and second rounds, respectively. When our proposed AL method was used with LR, the improvements in precision followed a similar trend. However, the other AL algorithms tested did not show improved precision largely because of biased samples caused by the uncertainty sampling or differences between training and validation data sets.
临床药物相互作用(DDIs)不仅是医疗差错的主要原因,也是药物不良事件(ADEs)的主要原因。关于DDI临床毒性的已发表文献仍在显著增加,因此对高性能的DDI信息检索(IR)文本挖掘方法有很高的需求。IR及其机器学习(ML)算法的有效性取决于大量经过人工审核和注释的训练及验证数据的可用性。在本研究中,我们调查了主动学习(AL)如何提高临床安全性DDI IR分析中的ML性能。我们认识到,直接应用AL并不能解决文献中DDI IR的几个主要挑战。例如,PubMed中的绝大多数摘要将为阴性,现有的阳性和阴性标记样本不能代表一般样本分布,并且在AL算法的不确定性采样过程中可能会出现潜在的偏差样本。因此,我们开发了几种新颖的采样和ML方案,以提高DDI IR分析中的AL性能。特别是,随机负采样被添加为AL的一部分,因为它在人工数据标记方面没有成本。我们还在AL过程中使用了两种ML算法,以区分随机负样本和人工标记的负样本,并在AL过程中更新训练和验证样本,以避免或减少偏差采样。使用两种监督ML算法,支持向量机(SVM)和逻辑回归(LR),来研究我们提出的AL算法的一致性。由于临床安全性DDI IR的最终目标是检索所有与DDI毒性相关的摘要,因此在开发AL方法时设定了0.99的召回率。当我们将新提出的AL方法与SVM一起使用时,区分阳性样本与人工标记阴性样本的精度从第一轮的0.45提高到第二轮的0.83,区分阳性样本与随机负样本的精度在第一轮和第二轮分别从0.70提高到0.82。当我们提出的AL方法与LR一起使用时,精度的提高遵循类似的趋势。然而,测试的其他AL算法在很大程度上没有显示出精度的提高,这主要是由于不确定性采样或训练与验证数据集之间的差异导致的偏差样本。