使用Apache Spark和共形预测进行高效迭代虚拟筛选。

Efficient iterative virtual screening with Apache Spark and conformal prediction.

作者信息

Ahmed Laeeq, Georgiev Valentin, Capuccini Marco, Toor Salman, Schaal Wesley, Laure Erwin, Spjuth Ola

机构信息

Department of Computational Science and Technology, Royal Institute of Technology (KTH), Lindstedtsvägen 5, 10044, Stockholm, Sweden.

Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 75124, Uppsala, Sweden.

出版信息

J Cheminform. 2018 Mar 1;10(1):8. doi: 10.1186/s13321-018-0265-z.

DOI:10.1186/s13321-018-0265-z

PMID:29492726

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5833896/

Abstract

BACKGROUND

Docking and scoring large libraries of ligands against target proteins forms the basis of structure-based virtual screening. The problem is trivially parallelizable, and calculations are generally carried out on computer clusters or on large workstations in a brute force manner, by docking and scoring all available ligands.

CONTRIBUTION

In this study we propose a strategy that is based on iteratively docking a set of ligands to form a training set, training a ligand-based model on this set, and predicting the remainder of the ligands to exclude those predicted as 'low-scoring' ligands. Then, another set of ligands are docked, the model is retrained and the process is repeated until a certain model efficiency level is reached. Thereafter, the remaining ligands are docked or excluded based on this model. We use SVM and conformal prediction to deliver valid prediction intervals for ranking the predicted ligands, and Apache Spark to parallelize both the docking and the modeling.

RESULTS

We show on 4 different targets that conformal prediction based virtual screening (CPVS) is able to reduce the number of docked molecules by 62.61% while retaining an accuracy for the top 30 hits of 94% on average and a speedup of 3.7. The implementation is available as open source via GitHub ( https://github.com/laeeq80/spark-cpvs ) and can be run on high-performance computers as well as on cloud resources.

摘要

背景

针对目标蛋白对接和评分大量配体库是基于结构的虚拟筛选的基础。这个问题很容易并行化，计算通常以蛮力方式在计算机集群或大型工作站上进行，即对接和评分所有可用配体。

贡献

在本研究中，我们提出了一种策略，该策略基于迭代对接一组配体以形成训练集，在此集合上训练基于配体的模型，并预测其余配体以排除那些被预测为“低分”的配体。然后，对接另一组配体，重新训练模型并重复该过程，直到达到一定的模型效率水平。此后，根据该模型对接或排除剩余的配体。我们使用支持向量机和共形预测来为预测的配体排名提供有效的预测区间，并使用Apache Spark来并行化对接和建模。

结果

我们在4个不同的目标上表明，基于共形预测的虚拟筛选（CPVS）能够将对接分子的数量减少62.61%，同时在前30个命中结果上平均保持94%的准确率和3.7的加速比。该实现可通过GitHub（https://github.com/laeeq80/spark-cpvs）作为开源获取，并且可以在高性能计算机以及云资源上运行。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e16e/5833896/f1bf185436a3/13321_2018_265_Fig1_HTML.jpg

相似文献

Efficient iterative virtual screening with Apache Spark and conformal prediction.

J Cheminform. 2018 Mar 1;10(1):8. doi: 10.1186/s13321-018-0265-z.

Large-scale virtual screening on public cloud resources with Apache Spark.

J Cheminform. 2017 Mar 6;9:15. doi: 10.1186/s13321-017-0204-4. eCollection 2017.

Artificial intelligence-enabled virtual screening of ultra-large chemical libraries with deep docking.

Nat Protoc. 2022 Mar;17(3):672-697. doi: 10.1038/s41596-021-00659-2. Epub 2022 Feb 4.

Machine learning in computational docking.

Artif Intell Med. 2015 Mar;63(3):135-52. doi: 10.1016/j.artmed.2015.02.002. Epub 2015 Feb 16.

Task-Specific Scoring Functions for Predicting Ligand Binding Poses and Affinity and for Screening Enrichment.

J Chem Inf Model. 2018 Jan 22;58(1):119-133. doi: 10.1021/acs.jcim.7b00309. Epub 2017 Dec 20.

Protein-based virtual screening of chemical databases. 1. Evaluation of different docking/scoring combinations.

J Med Chem. 2000 Dec 14;43(25):4759-67. doi: 10.1021/jm001044l.

Calculating an optimal box size for ligand docking and virtual screening against experimental and predicted binding pockets.

J Cheminform. 2015 May 15;7:18. doi: 10.1186/s13321-015-0067-5. eCollection 2015.

Boosted neural networks scoring functions for accurate ligand docking and ranking.

J Bioinform Comput Biol. 2018 Apr;16(2):1850004. doi: 10.1142/S021972001850004X. Epub 2018 Feb 4.

Structure-based virtual screening with supervised consensus scoring: evaluation of pose prediction and enrichment factors.

J Chem Inf Model. 2008 Apr;48(4):747-54. doi: 10.1021/ci700464x. Epub 2008 Mar 5.

Protein tyrosine phosphatases: Ligand interaction analysis and optimisation of virtual screening.

J Mol Graph Model. 2014 Jul;52:114-23. doi: 10.1016/j.jmgm.2014.06.011. Epub 2014 Jul 5.

引用本文的文献

Rapid traversal of vast chemical space using machine learning-guided docking screens.

Nat Comput Sci. 2025 Apr;5(4):301-312. doi: 10.1038/s43588-025-00777-x. Epub 2025 Mar 13.

Machine Learning-Driven Data Valuation for Optimizing High-Throughput Screening Pipelines.

J Chem Inf Model. 2024 Nov 11;64(21):8142-8152. doi: 10.1021/acs.jcim.4c01547. Epub 2024 Oct 23.

Inverse mapping of quantum properties to structures for chemical space of small organic molecules.

Nat Commun. 2024 Jul 18;15(1):6061. doi: 10.1038/s41467-024-50401-1.

Integrating QSAR modelling and deep learning in drug discovery: the emergence of deep QSAR.

Nat Rev Drug Discov. 2024 Feb;23(2):141-155. doi: 10.1038/s41573-023-00832-0. Epub 2023 Dec 8.

Framing Apache Spark in life sciences.

Heliyon. 2023 Feb 9;9(2):e13368. doi: 10.1016/j.heliyon.2023.e13368. eCollection 2023 Feb.

Artificial intelligence-enabled virtual screening of ultra-large chemical libraries with deep docking.

Nat Protoc. 2022 Mar;17(3):672-697. doi: 10.1038/s41596-021-00659-2. Epub 2022 Feb 4.

Accelerating high-throughput virtual screening through molecular pool-based active learning.

Chem Sci. 2021 Apr 29;12(22):7866-7881. doi: 10.1039/d0sc06805e.

Deep Docking: A Deep Learning Platform for Augmentation of Structure Based Drug Discovery.

ACS Cent Sci. 2020 Jun 24;6(6):939-949. doi: 10.1021/acscentsci.0c00229. Epub 2020 May 19.

Novel applications of Machine Learning in cheminformatics.

J Cheminform. 2018 Sep 6;10(1):46. doi: 10.1186/s13321-018-0301-z.

本文引用的文献

Large-scale virtual screening on public cloud resources with Apache Spark.

J Cheminform. 2017 Mar 6;9:15. doi: 10.1186/s13321-017-0204-4. eCollection 2017.

Improving Screening Efficiency through Iterative Screening Using Docking and Conformal Prediction.

J Chem Inf Model. 2017 Mar 27;57(3):439-444. doi: 10.1021/acs.jcim.6b00532. Epub 2017 Feb 28.

Binary classification of imbalanced datasets using conformal prediction.

J Mol Graph Model. 2017 Mar;72:256-265. doi: 10.1016/j.jmgm.2017.01.008. Epub 2017 Jan 6.

Constructing and Validating High-Performance MIEC-SVM Models in Virtual Screening for Kinases: A Better Way for Actives Discovery.

Sci Rep. 2016 Apr 22;6:24817. doi: 10.1038/srep24817.

SureChEMBL: a large-scale, chemically annotated patent document database.

Nucleic Acids Res. 2016 Jan 4;44(D1):D1220-8. doi: 10.1093/nar/gkv1253. Epub 2015 Nov 17.

Discovery of Novel ROCK1 Inhibitors via Integrated Virtual Screening Strategy and Bioassays.

Sci Rep. 2015 Nov 16;5:16749. doi: 10.1038/srep16749.

Toward a benchmarking data set able to evaluate ligand- and structure-based virtual screening using public HTS data.

J Chem Inf Model. 2015 Feb 23;55(2):343-53. doi: 10.1021/ci5005465. Epub 2015 Jan 28.

Benchmarking study of parameter variation when using signature fingerprints together with support vector machines.

J Chem Inf Model. 2014 Nov 24;54(11):3211-7. doi: 10.1021/ci500344v. Epub 2014 Oct 28.

Machine learning methods in chemoinformatics.

Wiley Interdiscip Rev Comput Mol Sci. 2014 Sep 1;4(5):468-481. doi: 10.1002/wcms.1183.

Introducing conformal prediction in predictive modeling. A transparent and flexible alternative to applicability domain determination.

J Chem Inf Model. 2014 Jun 23;54(6):1596-603. doi: 10.1021/ci5001168. Epub 2014 May 21.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用Apache Spark和共形预测进行高效迭代虚拟筛选。

Efficient iterative virtual screening with Apache Spark and conformal prediction.

作者信息

Ahmed Laeeq, Georgiev Valentin, Capuccini Marco, Toor Salman, Schaal Wesley, Laure Erwin, Spjuth Ola

机构信息

Department of Computational Science and Technology, Royal Institute of Technology (KTH), Lindstedtsvägen 5, 10044, Stockholm, Sweden.

Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 75124, Uppsala, Sweden.