Department of Statistics, University of Oxford, Oxford OX1 2JD, United Kingdom.
Diamond Light Source Ltd., Harwell Science and Innovation Campus, Didcot OX11 0DE, United Kingdom.
J Chem Inf Model. 2023 May 22;63(10):2960-2974. doi: 10.1021/acs.jcim.3c00322. Epub 2023 May 11.
Over the past few years, many machine learning-based scoring functions for predicting the binding of small molecules to proteins have been developed. Their objective is to approximate the distribution which takes two molecules as input and outputs the energy of their interaction. Only a scoring function that accounts for the interatomic interactions involved in binding can accurately predict binding affinity on unseen molecules. However, many scoring functions make predictions based on data set biases rather than an understanding of the physics of binding. These scoring functions perform well when tested on similar targets to those in the training set but fail to generalize to dissimilar targets. To test what a machine learning-based scoring function has learned, input attribution, a technique for learning which features are important to a model when making a prediction on a particular data point, can be applied. If a model successfully learns something beyond data set biases, attribution should give insight into the important binding interactions that are taking place. We built a machine learning-based scoring function that aimed to avoid the influence of bias via thorough train and test data set filtering and show that it achieves comparable performance on the Comparative Assessment of Scoring Functions, 2016 (CASF-2016) benchmark to other leading methods. We then use the CASF-2016 test set to perform attribution and find that the bonds identified as important by PointVS, unlike those extracted from other scoring functions, have a high correlation with those found by a distance-based interaction profiler. We then show that attribution can be used to extract important binding pharmacophores from a given protein target when supplied with a number of bound structures. We use this information to perform fragment elaboration and see improvements in docking scores compared to using structural information from a traditional, data-based approach. This not only provides definitive proof that the scoring function has learned to identify some important binding interactions but also constitutes the first deep learning-based method for extracting structural information from a target for molecule design.
在过去的几年中,已经开发出许多基于机器学习的小分子与蛋白质结合打分函数,其目标是近似于输入两个分子并输出它们相互作用能量的分布。只有能够解释结合中涉及的原子间相互作用的打分函数才能准确预测未见分子的结合亲和力。然而,许多打分函数基于数据集偏差进行预测,而不是基于结合物理的理解。这些打分函数在测试与训练集中相似的靶子时表现良好,但无法推广到不同的靶标。为了测试基于机器学习的打分函数学到了什么,可以应用输入归因技术,该技术用于学习在特定数据点上进行预测时模型对于哪些特征重要。如果模型成功地学习了超出数据集偏差的内容,归因应该可以深入了解正在发生的重要结合相互作用。我们构建了一个基于机器学习的打分函数,旨在通过彻底的训练和测试数据集过滤来避免偏差的影响,并表明它在比较打分函数评估 2016 年(CASF-2016)基准测试中与其他领先方法相比具有相当的性能。然后,我们使用 CASF-2016 测试集进行归因,并发现 PointVS 确定的重要键与基于距离的相互作用分析器发现的键高度相关,而其他打分函数提取的键则不然。然后,我们表明,当提供一定数量的结合结构时,归因可以用于从给定的蛋白质靶标中提取重要的结合药效团。我们使用此信息进行片段细化,并看到与使用传统基于结构信息的数据方法相比,对接评分有所提高。这不仅提供了明确的证据表明打分函数已经学会识别一些重要的结合相互作用,而且构成了第一个基于深度学习的方法,用于从靶标中提取结构信息进行分子设计。