Ballester Pedro J, Schreyer Adrian, Blundell Tom L
European Bioinformatics Institute , Wellcome Trust Genome Campus, Hinxton - CB10 1SD, United Kingdom.
J Chem Inf Model. 2014 Mar 24;54(3):944-55. doi: 10.1021/ci500091r. Epub 2014 Feb 20.
Predicting the binding affinities of large sets of diverse molecules against a range of macromolecular targets is an extremely challenging task. The scoring functions that attempt such computational prediction are essential for exploiting and analyzing the outputs of docking, which is in turn an important tool in problems such as structure-based drug design. Classical scoring functions assume a predetermined theory-inspired functional form for the relationship between the variables that describe an experimentally determined or modeled structure of a protein-ligand complex and its binding affinity. The inherent problem of this approach is in the difficulty of explicitly modeling the various contributions of intermolecular interactions to binding affinity. New scoring functions based on machine-learning regression models, which are able to exploit effectively much larger amounts of experimental data and circumvent the need for a predetermined functional form, have already been shown to outperform a broad range of state-of-the-art scoring functions in a widely used benchmark. Here, we investigate the impact of the chemical description of the complex on the predictive power of the resulting scoring function using a systematic battery of numerical experiments. The latter resulted in the most accurate scoring function to date on the benchmark. Strikingly, we also found that a more precise chemical description of the protein-ligand complex does not generally lead to a more accurate prediction of binding affinity. We discuss four factors that may contribute to this result: modeling assumptions, codependence of representation and regression, data restricted to the bound state, and conformational heterogeneity in data.
预测大量不同分子与一系列大分子靶点的结合亲和力是一项极具挑战性的任务。尝试进行这种计算预测的评分函数对于利用和分析对接输出至关重要,而对接又是基于结构的药物设计等问题中的重要工具。经典评分函数为描述蛋白质 - 配体复合物的实验确定或建模结构及其结合亲和力的变量之间的关系假设了一种预先确定的、受理论启发的函数形式。这种方法的固有问题在于难以明确模拟分子间相互作用对结合亲和力的各种贡献。基于机器学习回归模型的新评分函数能够有效利用大量实验数据并规避对预定函数形式的需求,在广泛使用的基准测试中已被证明优于一系列广泛使用的先进评分函数。在此,我们使用一系列系统的数值实验研究复合物的化学描述对所得评分函数预测能力的影响。后者产生了迄今为止在基准测试中最准确的评分函数。令人惊讶的是,我们还发现对蛋白质 - 配体复合物进行更精确的化学描述通常不会导致对结合亲和力的更准确预测。我们讨论了可能导致这一结果的四个因素:建模假设、表示与回归的相互依赖性、限于结合状态的数据以及数据中的构象异质性。