Trakya University Faculty of Medicine, Department of Biostatistics and Medical Informatics, Edirne, Turkey.
J Chem Inf Model. 2020 Sep 28;60(9):4180-4190. doi: 10.1021/acs.jcim.9b01162. Epub 2020 Jul 8.
Drug discovery studies have become increasingly expensive and time-consuming processes. In the early phase of drug discovery studies, an extensive search has been performed to find drug-like compounds, which then can be optimized over time to become a marketed drug. One of the conventional ways of detecting active compounds is to perform an HTS (high-throughput screening) experiment. As of July 2019, the PubChem repository contains 1.3 million bioassays that are generated through HTS experiments. This feature of PubChem makes it a great resource for performing machine learning algorithms to develop classification models to detect active compounds for drug discovery studies. However, data sets obtained from PubChem are highly imbalanced. This imbalanced nature of the data sets has a negative impact on the classification performance of machine learning algorithms. Here, we explored the classification performance of deep neural networks (DNN) on imbalance compound data sets after applying various data balancing methods. We used five confirmatory HTS bioassays from the PubChem repository and applied one undersampling and three oversampling methods as data balancing methods. We used a fully connected, two-hidden-layer DNN model for the classification of active and inactive molecules. To evaluate the performance of the network, we calculated six performance metrics, including balanced accuracy, precision, recall, F1 score, Matthews correlation coefficient, and area under the ROC curve. The study results showed that the effect of imbalanced data on network performance could be mitigated to a degree by applying the data balancing methods. The level of imbalance, however, has a negative effect on the performance of the network.
药物发现研究已经成为一个日益昂贵和耗时的过程。在药物发现研究的早期阶段,已经进行了广泛的搜索,以寻找类似药物的化合物,然后可以随着时间的推移进行优化,成为一种上市药物。检测活性化合物的一种传统方法是进行高通量筛选(HTS)实验。截至 2019 年 7 月,PubChem 存储库包含 130 万个通过 HTS 实验生成的生物测定。PubChem 的这一特点使其成为执行机器学习算法的绝佳资源,以开发分类模型来检测药物发现研究中的活性化合物。然而,从 PubChem 获得的数据集中高度不平衡。这种数据集的不平衡性质对机器学习算法的分类性能有负面影响。在这里,我们在应用各种数据平衡方法后,研究了深度神经网络(DNN)在不平衡化合物数据集上的分类性能。我们使用了来自 PubChem 存储库的五个确认性 HTS 生物测定,并应用了一种欠采样和三种过采样方法作为数据平衡方法。我们使用全连接的、具有两个隐藏层的 DNN 模型来对活性和非活性分子进行分类。为了评估网络的性能,我们计算了六个性能指标,包括平衡准确性、精度、召回率、F1 分数、马修斯相关系数和 ROC 曲线下的面积。研究结果表明,通过应用数据平衡方法,可以在一定程度上减轻不平衡数据对网络性能的影响。然而,不平衡的程度对网络的性能有负面影响。