Suppr超能文献

基于实验数据集的多样性采样进行训练集和测试集选择的预测性定量构效关系建模。

Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection.

作者信息

Golbraikh Alexander, Tropsha Alexander

机构信息

The Laboratory for Molecular Modeling, School of Pharmacy, University of North Carolina, Chapel Hill, NC 27599-7360, USA.

出版信息

J Comput Aided Mol Des. 2002 May-Jun;16(5-6):357-69. doi: 10.1023/a:1020869118689.

Abstract

One of the most important characteristics of Quantitative Structure Activity Relashionships (QSAR) models is their predictive power. The latter can be defined as the ability of a model to predict accurately the target property (e.g., biological activity) of compounds that were not used for model development. We suggest that this goal can be achieved by rational division of an experimental SAR dataset into the training and test set, which are used for model development and validation, respectively. Given that all compounds are represented by points in multidimensional descriptor space, we argue that training and test sets must satisfy the following criteria: (i) Representative points of the test set must be close to those of the training set; (ii) Representative points of the training set must be close to representative points of the test set; (iii) Training set must be diverse. For quantitative description of these criteria, we use molecular dataset diversity indices introduced recently (Golbraikh, A., J. Chem. Inf. Comput. Sci., 40 (2000) 414-425). For rational division of a dataset into the training and test sets, we use three closely related sphere-exclusion algorithms. Using several experimental datasets, we demonstrate that QSAR models built and validated with our approach have statistically better predictive power than models generated with either random or activity ranking based selection of the training and test sets. We suggest that rational approaches to the selection of training and test sets based on diversity principles should be used routinely in all QSAR modeling research.

摘要

定量构效关系(QSAR)模型最重要的特征之一是其预测能力。后者可定义为模型准确预测未用于模型开发的化合物的目标性质(如生物活性)的能力。我们认为,通过将实验性构效关系数据集合理划分为训练集和测试集可以实现这一目标,这两个集合分别用于模型开发和验证。鉴于所有化合物都由多维描述符空间中的点表示,我们认为训练集和测试集必须满足以下标准:(i)测试集的代表性点必须接近训练集的代表性点;(ii)训练集的代表性点必须接近测试集的代表性点;(iii)训练集必须具有多样性。为了对这些标准进行定量描述,我们使用最近引入的分子数据集多样性指数(戈尔布赖赫,A.,《化学信息与计算机科学杂志》,40(2000)414 - 425)。为了将数据集合理划分为训练集和测试集,我们使用三种密切相关的球排除算法。通过使用几个实验数据集,我们证明,用我们的方法构建和验证的QSAR模型在统计学上比使用基于随机或活性排序选择训练集和测试集生成的模型具有更好的预测能力。我们建议,在所有QSAR建模研究中应常规使用基于多样性原则的合理方法来选择训练集和测试集。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验