Program in Molecular Therapeutics, Fox Chase Cancer Center, Philadelphia, PA 19111.
Center for Computational Biology, University of Kansas, Lawrence, KS 66045.
Proc Natl Acad Sci U S A. 2020 Aug 4;117(31):18477-18488. doi: 10.1073/pnas.2000585117. Epub 2020 Jul 15.
With the recent explosion in the size of libraries available for screening, virtual screening is positioned to assume a more prominent role in early drug discovery's search for active chemical matter. In typical virtual screens, however, only about 12% of the top-scoring compounds actually show activity when tested in biochemical assays. We argue that most scoring functions used for this task have been developed with insufficient thoughtfulness into the datasets on which they are trained and tested, leading to overly simplistic models and/or overtraining. These problems are compounded in the literature because studies reporting new scoring methods have not validated their models prospectively within the same study. Here, we report a strategy for building a training dataset (D-COID) that aims to generate highly compelling decoy complexes that are individually matched to available active complexes. Using this dataset, we train a general-purpose classifier for virtual screening (vScreenML) that is built on the XGBoost framework. In retrospective benchmarks, our classifier shows outstanding performance relative to other scoring functions. In a prospective context, nearly all candidate inhibitors from a screen against acetylcholinesterase show detectable activity; beyond this, 10 of 23 compounds have IC better than 50 μM. Without any medicinal chemistry optimization, the most potent hit has IC 280 nM, corresponding to of 173 nM. These results support using the D-COID strategy for training classifiers in other computational biology tasks, and for vScreenML in virtual screening campaigns against other protein targets. Both D-COID and vScreenML are freely distributed to facilitate such efforts.
随着可供筛选的文库规模的最近爆炸式增长,虚拟筛选有望在早期药物发现中寻找活性化学物质的搜索中发挥更突出的作用。然而,在典型的虚拟筛选中,只有大约 12%的得分最高的化合物在生化测定中实际显示活性。我们认为,用于此任务的大多数评分函数在训练和测试所依据的数据集上没有经过充分的思考,导致模型过于简单化和/或过度训练。由于报告新评分方法的研究没有在同一研究中前瞻性地验证其模型,这些问题在文献中更加复杂。在这里,我们报告了一种构建训练数据集(D-COID)的策略,该策略旨在生成高度引人注目的诱饵复合物,这些复合物与可用的活性复合物个体匹配。使用该数据集,我们基于 XGBoost 框架训练了一种用于虚拟筛选的通用分类器(vScreenML)。在回顾性基准测试中,我们的分类器相对于其他评分函数表现出色。在前瞻性背景下,从乙酰胆碱酯酶筛选中几乎所有候选抑制剂都显示出可检测的活性;除此之外,23 种化合物中有 10 种的 IC 优于 50 μM。没有任何药物化学优化,最有效的化合物的 IC 为 280 nM,对应于 173 nM 的 。这些结果支持在其他计算生物学任务中使用 D-COID 策略来训练分类器,以及在针对其他蛋白质靶标的虚拟筛选活动中使用 vScreenML。D-COID 和 vScreenML 均免费分发,以促进此类工作。