Baskin Igor I, Kireeva Natalia, Varnek Alexandre
Department of Chemistry, Moscow State University, Moscow 119991, Russia.
Laboratoire d'Infochimie, UMR 7177 CNRS, Université de Strasbourg, 4, rue B. Pascal, Strasbourg 67000, France.
Mol Inform. 2010 Sep 17;29(8-9):581-7. doi: 10.1002/minf.201000063. Epub 2010 Aug 30.
In this paper, we associate an applicability domain (AD) of QSAR/QSPR models with the area in the input (descriptor) space in which the density of training data points exceeds a certain threshold. It could be proved that the predictive performance of the models (built on the training set) is larger for the test compounds inside the high density area, than for those outside this area. Instead of searching a decision surface separating high and low density areas in the input space, the one-class classification 1-SVM approach looks for a hyperplane in the associated feature space. Unlike other reported in the literature AD definitions, this approach: (i) is purely "data-based", i.e. it assigns the same AD to all models built on the same training set, (ii) provides results that depend only on the initial descriptors pool generated for the training set, (iii) can be used for the huge number of descriptors, as well as in the framework of structured kernel-based approaches, e.g., chemical graph kernels. The developed approach has been applied to improve the performance of QSPR models for stability constants of the complexes of organic ligands with alkaline-earth metals in water.
在本文中,我们将定量构效关系/定量构性关系(QSAR/QSPR)模型的适用域(AD)与输入(描述符)空间中训练数据点密度超过特定阈值的区域相关联。可以证明,对于高密度区域内的测试化合物,(基于训练集构建的)模型的预测性能要高于该区域外的测试化合物。单类分类1 - 支持向量机(1 - SVM)方法不是在输入空间中寻找分隔高密度和低密度区域的决策面,而是在相关特征空间中寻找一个超平面。与文献中报道的其他AD定义不同,该方法:(i)纯粹是“基于数据的”,即它为基于相同训练集构建的所有模型分配相同的AD,(ii)提供的结果仅取决于为训练集生成的初始描述符库,(iii)可用于大量描述符,以及基于结构化核的方法框架中,例如化学图核。所开发的方法已被应用于提高关于有机配体与碱土金属在水中配合物稳定常数的QSPR模型的性能。