Centre for Molecular Informatics, Department of Chemistry , University of Cambridge , Lensfield Road , Cambridge CB2 1EW , United Kingdom.
Centre for Medical Image Computing, Department of Computer Science , UCL , London WC1E 6BT , United Kingdom.
J Chem Inf Model. 2018 Sep 24;58(9):2000-2014. doi: 10.1021/acs.jcim.8b00376. Epub 2018 Sep 10.
The versatility of similarity searching and quantitative structure-activity relationships to model the activity of compound sets within given bioactivity ranges (i.e., interpolation) is well established. However, their relative performance in the common scenario in early stage drug discovery where lots of inactive data but no active data points are available (i.e., extrapolation from the low-activity to the high-activity range) has not been thoroughly examined yet. To this aim, we have designed an iterative virtual screening strategy which was evaluated on 25 diverse bioactivity data sets from ChEMBL. We benchmark the efficiency of random forest (RF), multiple linear regression, ridge regression, similarity searching, and random selection of compounds to identify a highly active molecule in the test set among a large number of low-potency compounds. We use the number of iterations required to find this active molecule to evaluate the performance of each experimental setup. We show that linear and ridge regression often outperform RF and similarity searching, reducing the number of iterations to find an active compound by a factor of 2 or more. Even simple regression methods seem better able to extrapolate to high-bioactivity ranges than RF, which only provides output values in the range covered by the training set. In addition, examination of the scaffold diversity in the data sets used shows that in some cases similarity searching and RF require two times as many iterations as random selection depending on the chemical space covered in the initial training data. Lastly, we show using bioactivity data for COX-1 and COX-2 that our framework can be extended to multitarget drug discovery, where compounds are selected by concomitantly considering their activity against multiple targets. Overall, this study provides an approach for iterative screening where only inactive data are present in early stages of drug discovery in order to discover highly potent compounds and the best experimental set up in which to do so.
相似性搜索和定量构效关系在给定的生物活性范围内模拟化合物集的活性(即内插)的多功能性已得到充分证实。然而,它们在早期药物发现中常见的情况下(即在没有活性数据点但有大量非活性数据的情况下)的相对性能尚未得到彻底研究。为此,我们设计了一种迭代虚拟筛选策略,该策略在来自 ChEMBL 的 25 个不同的生物活性数据集上进行了评估。我们使用随机森林(RF)、多元线性回归、岭回归、相似性搜索和化合物的随机选择来评估效率,以在大量低活性化合物中识别测试集中的高活性分子。我们使用找到这种活性分子所需的迭代次数来评估每个实验设置的性能。我们表明,线性和岭回归通常优于 RF 和相似性搜索,将找到活性化合物所需的迭代次数减少了 2 倍或更多。即使是简单的回归方法,似乎也比仅在训练集覆盖范围内提供输出值的 RF 更能外推到高生物活性范围。此外,对所使用的数据集的支架多样性的检查表明,在某些情况下,相似性搜索和 RF 所需的迭代次数是随机选择的两倍,具体取决于初始训练数据中涵盖的化学空间。最后,我们使用 COX-1 和 COX-2 的生物活性数据表明,我们的框架可以扩展到多靶标药物发现,其中通过同时考虑化合物对多个靶标的活性来选择化合物。总的来说,这项研究提供了一种迭代筛选的方法,在药物发现的早期阶段只有非活性数据,以便发现高活性化合物和最佳的实验设置。