Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Cambridge, UK.
Bioinformatics. 2010 May 1;26(9):1169-75. doi: 10.1093/bioinformatics/btq112. Epub 2010 Mar 17.
Accurately predicting the binding affinities of large sets of diverse protein-ligand complexes is an extremely challenging task. The scoring functions that attempt such computational prediction are essential for analysing the outputs of molecular docking, which in turn is an important technique for drug discovery, chemical biology and structural biology. Each scoring function assumes a predetermined theory-inspired functional form for the relationship between the variables that characterize the complex, which also include parameters fitted to experimental or simulation data and its predicted binding affinity. The inherent problem of this rigid approach is that it leads to poor predictivity for those complexes that do not conform to the modelling assumptions. Moreover, resampling strategies, such as cross-validation or bootstrapping, are still not systematically used to guard against the overfitting of calibration data in parameter estimation for scoring functions.
We propose a novel scoring function (RF-Score) that circumvents the need for problematic modelling assumptions via non-parametric machine learning. In particular, Random Forest was used to implicitly capture binding effects that are hard to model explicitly. RF-Score is compared with the state of the art on the demanding PDBbind benchmark. Results show that RF-Score is a very competitive scoring function. Importantly, RF-Score's performance was shown to improve dramatically with training set size and hence the future availability of more high-quality structural and interaction data is expected to lead to improved versions of RF-Score.
pedro.ballester@ebi.ac.uk; jbom@st-andrews.ac.uk
Supplementary data are available at Bioinformatics online.
准确预测大量不同的蛋白质-配体复合物的结合亲和力是一项极具挑战性的任务。尝试进行这种计算预测的评分函数对于分析分子对接的输出至关重要,而分子对接反过来又是药物发现、化学生物学和结构生物学的重要技术。每个评分函数都假定了一个预先确定的理论启发式函数形式,用于描述复合物特征的变量之间的关系,这些变量还包括拟合实验或模拟数据及其预测的结合亲和力的参数。这种僵化方法的固有问题是,对于那些不符合建模假设的复合物,其预测能力较差。此外,重新采样策略(如交叉验证或自举)仍然没有系统地用于防止在评分函数的参数估计中对校准数据的过度拟合。
我们提出了一种新的评分函数(RF-Score),通过非参数机器学习规避了对有问题的建模假设的需求。特别是,随机森林被用于隐式捕捉难以显式建模的结合效应。RF-Score 与要求苛刻的 PDBbind 基准进行了比较。结果表明,RF-Score 是一个非常有竞争力的评分函数。重要的是,随着训练集规模的增加,RF-Score 的性能得到了显著提高,因此预计未来将有更多高质量的结构和相互作用数据可用,这将导致 RF-Score 的改进版本。
pedro.ballester@ebi.ac.uk;jbom@st-andrews.ac.uk
补充数据可在 Bioinformatics 在线获得。