Ahmed Laeeq, Georgiev Valentin, Capuccini Marco, Toor Salman, Schaal Wesley, Laure Erwin, Spjuth Ola
Department of Computational Science and Technology, Royal Institute of Technology (KTH), Lindstedtsvägen 5, 10044, Stockholm, Sweden.
Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 75124, Uppsala, Sweden.
J Cheminform. 2018 Mar 1;10(1):8. doi: 10.1186/s13321-018-0265-z.
Docking and scoring large libraries of ligands against target proteins forms the basis of structure-based virtual screening. The problem is trivially parallelizable, and calculations are generally carried out on computer clusters or on large workstations in a brute force manner, by docking and scoring all available ligands.
In this study we propose a strategy that is based on iteratively docking a set of ligands to form a training set, training a ligand-based model on this set, and predicting the remainder of the ligands to exclude those predicted as 'low-scoring' ligands. Then, another set of ligands are docked, the model is retrained and the process is repeated until a certain model efficiency level is reached. Thereafter, the remaining ligands are docked or excluded based on this model. We use SVM and conformal prediction to deliver valid prediction intervals for ranking the predicted ligands, and Apache Spark to parallelize both the docking and the modeling.
We show on 4 different targets that conformal prediction based virtual screening (CPVS) is able to reduce the number of docked molecules by 62.61% while retaining an accuracy for the top 30 hits of 94% on average and a speedup of 3.7. The implementation is available as open source via GitHub ( https://github.com/laeeq80/spark-cpvs ) and can be run on high-performance computers as well as on cloud resources.
针对目标蛋白对接和评分大量配体库是基于结构的虚拟筛选的基础。这个问题很容易并行化,计算通常以蛮力方式在计算机集群或大型工作站上进行,即对接和评分所有可用配体。
在本研究中,我们提出了一种策略,该策略基于迭代对接一组配体以形成训练集,在此集合上训练基于配体的模型,并预测其余配体以排除那些被预测为“低分”的配体。然后,对接另一组配体,重新训练模型并重复该过程,直到达到一定的模型效率水平。此后,根据该模型对接或排除剩余的配体。我们使用支持向量机和共形预测来为预测的配体排名提供有效的预测区间,并使用Apache Spark来并行化对接和建模。
我们在4个不同的目标上表明,基于共形预测的虚拟筛选(CPVS)能够将对接分子的数量减少62.61%,同时在前30个命中结果上平均保持94%的准确率和3.7的加速比。该实现可通过GitHub(https://github.com/laeeq80/spark-cpvs)作为开源获取,并且可以在高性能计算机以及云资源上运行。