Chemistry Modeling and Informatics, Merck Research Laboratories, Rahway, New Jersey 07065, USA.
J Chem Inf Model. 2012 Mar 26;52(3):814-23. doi: 10.1021/ci300004n. Epub 2012 Mar 9.
One popular metric for estimating the accuracy of prospective quantitative structure-activity relationship (QSAR) predictions is based on the similarity of the compound being predicted to compounds in the training set from which the QSAR model was built. More recent work in the field has indicated that other parameters might be equally or more important than similarity. Here we make use of two additional parameters: the variation of prediction among random forest trees (less variation among trees indicates more accurate prediction) and the prediction itself (certain ranges of activity are intrinsically easier to predict than others). The accuracy of prediction for a QSAR model, as measured by the root-mean-square error, can be estimated by cross-validation on the training set at the time of model-building and stored as a three-dimensional array of bins. This is an obvious extension of the one-dimensional array of bins we previously proposed for similarity to the training set [Sheridan et al. J. Chem. Inf. Comput. Sci.2004, 44, 1912-1928]. We show that using these three parameters simultaneously adds much more discrimination in prediction accuracy than any single parameter. This approach can be applied to any QSAR method that produces an ensemble of models. We also show that the root-mean-square errors produced by cross-validation are predictive of root-mean-square errors of compounds tested after the model was built.
一种用于评估定量构效关系(QSAR)预测准确性的常用指标是基于被预测化合物与构建 QSAR 模型的训练集中化合物的相似性。该领域的最新研究表明,其他参数可能与相似性同等重要或更为重要。在这里,我们利用了另外两个参数:随机森林树之间预测的变化(树之间的变化越小,预测越准确)和预测本身(某些活性范围比其他范围更容易预测)。通过在构建模型时对训练集进行交叉验证,可以估计 QSAR 模型的预测准确性,预测准确性由均方根误差来衡量,并作为三维箱的数组存储。这是我们之前提出的用于与训练集相似性的一维箱数组的明显扩展[Sheridan 等人。J. Chem. Inf. Comput. Sci.2004, 44, 1912-1928]。我们表明,同时使用这三个参数可以比任何单个参数更能提高预测准确性的区分度。这种方法可应用于产生模型集合的任何 QSAR 方法。我们还表明,模型构建后测试化合物的均方根误差可以预测交叉验证产生的均方根误差。