Argonne National Laboratory, Data Science and Learning Division, Chicago, Lemont, 60439, USA.
Department of Computer Science, University of Chicago, Chicago, 60637, USA.
Sci Rep. 2023 Feb 6;13(1):2105. doi: 10.1038/s41598-023-28785-9.
Protein-ligand docking is a computational method for identifying drug leads. The method is capable of narrowing a vast library of compounds down to a tractable size for downstream simulation or experimental testing and is widely used in drug discovery. While there has been progress in accelerating scoring of compounds with artificial intelligence, few works have bridged these successes back to the virtual screening community in terms of utility and forward-looking development. We demonstrate the power of high-speed ML models by scoring 1 billion molecules in under a day (50 k predictions per GPU seconds). We showcase a workflow for docking utilizing surrogate AI-based models as a pre-filter to a standard docking workflow. Our workflow is ten times faster at screening a library of compounds than the standard technique, with an error rate less than 0.01% of detecting the underlying best scoring 0.1% of compounds. Our analysis of the speedup explains that another order of magnitude speedup must come from model accuracy rather than computing speed. In order to drive another order of magnitude of acceleration, we share a benchmark dataset consisting of 200 million 3D complex structures and 2D structure scores across a consistent set of 13 million "in-stock" molecules over 15 receptors, or binding sites, across the SARS-CoV-2 proteome. We believe this is strong evidence for the community to begin focusing on improving the accuracy of surrogate models to improve the ability to screen massive compound libraries 100 × or even 1000 × faster than current techniques and reduce missing top hits. The technique outlined aims to be a fast drop-in replacement for docking for screening billion-scale molecular libraries.
蛋白质配体对接是一种用于识别药物先导物的计算方法。该方法能够将庞大的化合物库缩小到可处理的规模,以便进行下游模拟或实验测试,因此被广泛应用于药物发现领域。虽然在利用人工智能加速化合物评分方面已经取得了进展,但很少有工作在效用和前瞻性发展方面将这些成功与虚拟筛选社区联系起来。我们通过在不到一天的时间内(每个 GPU 秒预测 50 k)对 10 亿个分子进行评分,展示了高速 ML 模型的强大功能。我们展示了一种利用基于 AI 的替代模型进行对接的工作流程,作为标准对接工作流程的预筛选。与标准技术相比,我们的工作流程在筛选化合物库方面的速度快了 10 倍,错误率低于检测到基础最佳评分的 0.01%的化合物的 0.01%。我们对加速的分析表明,另一个数量级的加速必须来自模型的准确性,而不是计算速度。为了实现另一个数量级的加速,我们共享了一个基准数据集,该数据集包含 2 亿个 3D 复合物结构和 2D 结构分数,涵盖了 15 个 SARS-CoV-2 蛋白质组中 15 个受体或结合位点上的 1300 万“现货”分子的一致集合。我们认为,这为社区提供了强有力的证据,开始关注提高替代模型的准确性,以提高筛选大规模化合物库的能力,速度比当前技术快 100 倍甚至 1000 倍,并减少错过顶级命中。概述的技术旨在成为一种快速的对接替代品,用于筛选十亿规模的分子库。