Novartis Institutes for BioMedical Research, Novartis Pharma AG, Forum 1, Novartis Campus, CH-4056 Basel, Switzerland.
J Chem Inf Model. 2010 Nov 22;50(11):1961-9. doi: 10.1021/ci100264e. Epub 2010 Oct 12.
With the emergence of large collections of protein-ligand complexes complemented by binding data, as found in PDBbind or BindingMOAD, new opportunities for parametrizing and evaluating scoring functions have arisen. With huge data collections available, it becomes feasible to fit scoring functions in a QSAR style, i.e., by defining protein-ligand interaction descriptors and analyzing them with modern machine-learning methods. As in each data modeling ansatz, care has to be taken to validate the model carefully. Here, we show that there are large differences measured in R (0.77 vs 0.46) or R² (0.59 vs 0.21) for a relatively simple scoring function depending on whether it is validated against the PDBbind core set or validated in a leave-cluster-out cross-validation. If proteins from the same family are present in both the training and validation set, the estimated prediction quality from standard validation techniques looks too optimistic.
随着包含结合数据的大型蛋白质-配体复合物数据集的出现,例如 PDBbind 或 BindingMOAD 中所包含的数据集,参数化和评估评分函数的新机会也随之出现。有了大量可用的数据集合,就可以采用 QSAR 风格来拟合评分函数,即通过定义蛋白质-配体相互作用描述符,并使用现代机器学习方法对其进行分析。与每个数据建模方法一样,必须小心谨慎地验证模型。在这里,我们表明,对于一个相对简单的评分函数,根据其是针对 PDBbind 核心集进行验证还是在聚类外交叉验证中进行验证,其 R(0.77 与 0.46)或 R²(0.59 与 0.21)的测量值存在较大差异。如果同一族的蛋白质同时存在于训练集和验证集中,则来自标准验证技术的估计预测质量看起来过于乐观。