Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA.
J Comput Aided Mol Des. 2021 Nov;35(11):1095-1123. doi: 10.1007/s10822-021-00423-4. Epub 2021 Oct 28.
The advent of computational drug discovery holds the promise of significantly reducing the effort of experimentalists, along with monetary cost. More generally, predicting the binding of small organic molecules to biological macromolecules has far-reaching implications for a range of problems, including metabolomics. However, problems such as predicting the bound structure of a protein-ligand complex along with its affinity have proven to be an enormous challenge. In recent years, machine learning-based methods have proven to be more accurate than older methods, many based on simple linear regression. Nonetheless, there remains room for improvement, as these methods are often trained on a small set of features, with a single functional form for any given physical effect, and often with little mention of the rationale behind choosing one functional form over another. Moreover, it is not entirely clear why one machine learning method is favored over another. In this work, we endeavor to undertake a comprehensive effort towards developing high-accuracy, machine-learned scoring functions, systematically investigating the effects of machine learning method and choice of features, and, when possible, providing insights into the relevant physics using methods that assess feature importance. Here, we show synergism among disparate features, yielding adjusted R with experimental binding affinities of up to 0.871 on an independent test set and enrichment for native bound structures of up to 0.913. When purely physical terms that model enthalpic and entropic effects are used in the training, we use feature importance assessments to probe the relevant physics and hopefully guide future investigators working on this and other computational chemistry problems.
计算药物发现的出现有望大大减少实验人员的工作量和成本。更一般地说,预测小分子有机分子与生物大分子的结合对于一系列问题,包括代谢组学,都有着深远的影响。然而,预测蛋白质-配体复合物的结合结构及其亲和力等问题已被证明是一个巨大的挑战。近年来,基于机器学习的方法已被证明比基于简单线性回归等旧方法更准确。尽管如此,仍有改进的空间,因为这些方法通常是基于一小部分特征进行训练的,对于任何给定的物理效应,只有一种单一的函数形式,而且通常很少提及选择一种函数形式而不是另一种函数形式的基本原理。此外,为什么一种机器学习方法比另一种更受欢迎,这一点并不完全清楚。在这项工作中,我们努力全面开展开发高精度、基于机器学习的评分函数的工作,系统地研究机器学习方法和特征选择的影响,并在可能的情况下,使用评估特征重要性的方法来深入了解相关物理知识。在这里,我们展示了不同特征之间的协同作用,在独立测试集上,与实验结合亲和力的调整 R 高达 0.871,与天然结合结构的富集度高达 0.913。当在训练中使用仅包含焓和熵效应的物理术语时,我们使用特征重要性评估来探究相关物理知识,并希望为从事这一领域和其他计算化学问题的未来研究人员提供指导。