Golbraikh Alexander, Tropsha Alexander
The Laboratory for Molecular Modeling, School of Pharmacy, University of North Carolina, Chapel Hill, NC 27599-7360, USA.
J Comput Aided Mol Des. 2002 May-Jun;16(5-6):357-69. doi: 10.1023/a:1020869118689.
One of the most important characteristics of Quantitative Structure Activity Relashionships (QSAR) models is their predictive power. The latter can be defined as the ability of a model to predict accurately the target property (e.g., biological activity) of compounds that were not used for model development. We suggest that this goal can be achieved by rational division of an experimental SAR dataset into the training and test set, which are used for model development and validation, respectively. Given that all compounds are represented by points in multidimensional descriptor space, we argue that training and test sets must satisfy the following criteria: (i) Representative points of the test set must be close to those of the training set; (ii) Representative points of the training set must be close to representative points of the test set; (iii) Training set must be diverse. For quantitative description of these criteria, we use molecular dataset diversity indices introduced recently (Golbraikh, A., J. Chem. Inf. Comput. Sci., 40 (2000) 414-425). For rational division of a dataset into the training and test sets, we use three closely related sphere-exclusion algorithms. Using several experimental datasets, we demonstrate that QSAR models built and validated with our approach have statistically better predictive power than models generated with either random or activity ranking based selection of the training and test sets. We suggest that rational approaches to the selection of training and test sets based on diversity principles should be used routinely in all QSAR modeling research.
定量构效关系(QSAR)模型最重要的特征之一是其预测能力。后者可定义为模型准确预测未用于模型开发的化合物的目标性质(如生物活性)的能力。我们认为,通过将实验性构效关系数据集合理划分为训练集和测试集可以实现这一目标,这两个集合分别用于模型开发和验证。鉴于所有化合物都由多维描述符空间中的点表示,我们认为训练集和测试集必须满足以下标准:(i)测试集的代表性点必须接近训练集的代表性点;(ii)训练集的代表性点必须接近测试集的代表性点;(iii)训练集必须具有多样性。为了对这些标准进行定量描述,我们使用最近引入的分子数据集多样性指数(戈尔布赖赫,A.,《化学信息与计算机科学杂志》,40(2000)414 - 425)。为了将数据集合理划分为训练集和测试集,我们使用三种密切相关的球排除算法。通过使用几个实验数据集,我们证明,用我们的方法构建和验证的QSAR模型在统计学上比使用基于随机或活性排序选择训练集和测试集生成的模型具有更好的预测能力。我们建议,在所有QSAR建模研究中应常规使用基于多样性原则的合理方法来选择训练集和测试集。