Graff David E, Shakhnovich Eugene I, Coley Connor W
Department of Chemistry and Chemical Biology, Harvard University Cambridge MA USA.
Department of Chemical Engineering, MIT Cambridge MA USA
Chem Sci. 2021 Apr 29;12(22):7866-7881. doi: 10.1039/d0sc06805e.
Structure-based virtual screening is an important tool in early stage drug discovery that scores the interactions between a target protein and candidate ligands. As virtual libraries continue to grow (in excess of 10 molecules), so too do the resources necessary to conduct exhaustive virtual screening campaigns on these libraries. However, Bayesian optimization techniques, previously employed in other scientific discovery problems, can aid in their exploration: a surrogate structure-property relationship model trained on the predicted affinities of a subset of the library can be applied to the remaining library members, allowing the least promising compounds to be excluded from evaluation. In this study, we explore the application of these techniques to computational docking datasets and assess the impact of surrogate model architecture, acquisition function, and acquisition batch size on optimization performance. We observe significant reductions in computational costs; for example, using a directed-message passing neural network we can identify 94.8% or 89.3% of the top-50 000 ligands in a 100M member library after testing only 2.4% of candidate ligands using an upper confidence bound or greedy acquisition strategy, respectively. Such model-guided searches mitigate the increasing computational costs of screening increasingly large virtual libraries and can accelerate high-throughput virtual screening campaigns with applications beyond docking.
基于结构的虚拟筛选是早期药物发现中的一项重要工具,它对靶蛋白与候选配体之间的相互作用进行评分。随着虚拟库持续增长(超过10^9个分子),对这些库进行详尽虚拟筛选所需的资源也在增加。然而,先前用于其他科学发现问题的贝叶斯优化技术有助于对它们进行探索:基于库的一个子集的预测亲和力训练的替代结构-属性关系模型可应用于其余库成员,从而将最没有前景的化合物排除在评估之外。在本研究中,我们探索这些技术在计算对接数据集上的应用,并评估替代模型架构、采集函数和采集批次大小对优化性能的影响。我们观察到计算成本显著降低;例如,使用定向消息传递神经网络,分别采用上置信界或贪婪采集策略,在测试仅2.4%的候选配体后,我们可以在一个包含1亿个成员的库中识别出前50000个配体中的94.8%或89.3%。这种模型引导的搜索减轻了筛选越来越大的虚拟库所增加的计算成本,并可以加速高通量虚拟筛选活动,其应用范围超出对接。