Department of Bioinformatics, Institute of Biomedical Chemistry , Moscow , Russia.
Department of Bioinformatics, Medical-Biological Department, Pirogov Russian National Research Medical University , Moscow , Russia.
SAR QSAR Environ Res. 2019 Oct;30(10):759-773. doi: 10.1080/1062936X.2019.1665580. Epub 2019 Sep 24.
Existing data on structures and biological activities are limited and distributed unevenly across distinct molecular targets and chemical compounds. The question arises if these data represent an unbiased sample of the general population of chemical-biological interactions. To answer this question, we analyzed ChEMBL data for 87,583 molecules tested against 919 protein targets using supervised and unsupervised approaches. Hierarchical clustering of the Murcko frameworks generated using Chemistry Development Toolkit showed that the available data form a big diffuse cloud without apparent structure. In contrast hereto, PASS-based classifiers allowed prediction whether the compound had been tested against the particular molecular target, despite whether it was active or not. Thus, one may conclude that the selection of chemical compounds for testing against specific targets is biased, probably due to the influence of prior knowledge. We assessed the possibility to improve (Q)SAR predictions using this fact: PASS prediction of the interaction with the particular target for compounds predicted as tested against the target has significantly higher accuracy than for those predicted as untested (average ROC AUC are about 0.87 and 0.75, respectively). Thus, considering the existing bias in the data of the training set may increase the performance of virtual screening.
现有的结构和生物活性数据有限,且在不同的分子靶标和化学化合物之间分布不均。问题是这些数据是否代表了化学-生物相互作用总体人群的无偏样本。为了回答这个问题,我们使用监督和无监督的方法分析了 ChEMBL 数据,这些数据涉及针对 919 个蛋白质靶标测试的 87,583 种分子。使用 Chemistry Development Toolkit 生成的 Murcko 框架的层次聚类表明,可用数据形成了一个没有明显结构的大弥散云。与此相反,基于 PASS 的分类器允许预测化合物是否针对特定的分子靶标进行了测试,无论它是否具有活性。因此,可以得出结论,针对特定靶标测试的化合物选择存在偏差,这可能是由于先验知识的影响。我们评估了利用这一事实来改进(QSAR)预测的可能性:对于预测为针对该靶标进行测试的化合物,PASS 对与特定靶标相互作用的预测具有明显更高的准确性,而对于预测为未测试的化合物则准确性较低(平均 ROC AUC 分别约为 0.87 和 0.75)。因此,考虑到训练集中数据存在的偏差,可能会提高虚拟筛选的性能。