基于实验数据集的多样性采样进行训练集和测试集选择的预测性定量构效关系建模。

Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection.

作者信息

Golbraikh Alexander, Tropsha Alexander

机构信息

The Laboratory for Molecular Modeling, School of Pharmacy, University of North Carolina, Chapel Hill, NC 27599-7360, USA.

出版信息

J Comput Aided Mol Des. 2002 May-Jun;16(5-6):357-69. doi: 10.1023/a:1020869118689.

DOI:10.1023/a:1020869118689

PMID:12489684

Abstract

One of the most important characteristics of Quantitative Structure Activity Relashionships (QSAR) models is their predictive power. The latter can be defined as the ability of a model to predict accurately the target property (e.g., biological activity) of compounds that were not used for model development. We suggest that this goal can be achieved by rational division of an experimental SAR dataset into the training and test set, which are used for model development and validation, respectively. Given that all compounds are represented by points in multidimensional descriptor space, we argue that training and test sets must satisfy the following criteria: (i) Representative points of the test set must be close to those of the training set; (ii) Representative points of the training set must be close to representative points of the test set; (iii) Training set must be diverse. For quantitative description of these criteria, we use molecular dataset diversity indices introduced recently (Golbraikh, A., J. Chem. Inf. Comput. Sci., 40 (2000) 414-425). For rational division of a dataset into the training and test sets, we use three closely related sphere-exclusion algorithms. Using several experimental datasets, we demonstrate that QSAR models built and validated with our approach have statistically better predictive power than models generated with either random or activity ranking based selection of the training and test sets. We suggest that rational approaches to the selection of training and test sets based on diversity principles should be used routinely in all QSAR modeling research.

摘要

定量构效关系（QSAR）模型最重要的特征之一是其预测能力。后者可定义为模型准确预测未用于模型开发的化合物的目标性质（如生物活性）的能力。我们认为，通过将实验性构效关系数据集合理划分为训练集和测试集可以实现这一目标，这两个集合分别用于模型开发和验证。鉴于所有化合物都由多维描述符空间中的点表示，我们认为训练集和测试集必须满足以下标准：（i）测试集的代表性点必须接近训练集的代表性点；（ii）训练集的代表性点必须接近测试集的代表性点；（iii）训练集必须具有多样性。为了对这些标准进行定量描述，我们使用最近引入的分子数据集多样性指数（戈尔布赖赫，A.，《化学信息与计算机科学杂志》，40（2000）414 - 425）。为了将数据集合理划分为训练集和测试集，我们使用三种密切相关的球排除算法。通过使用几个实验数据集，我们证明，用我们的方法构建和验证的QSAR模型在统计学上比使用基于随机或活性排序选择训练集和测试集生成的模型具有更好的预测能力。我们建议，在所有QSAR建模研究中应常规使用基于多样性原则的合理方法来选择训练集和测试集。

相似文献

Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection.基于实验数据集的多样性采样进行训练集和测试集选择的预测性定量构效关系建模。

J Comput Aided Mol Des. 2002 May-Jun;16(5-6):357-69. doi: 10.1023/a:1020869118689.

Mol Divers. 2002;5(4):231-43. doi: 10.1023/a:1021372108686.

Combinatorial QSAR of ambergris fragrance compounds.龙涎香香料化合物的组合定量构效关系

J Chem Inf Comput Sci. 2004 Mar-Apr;44(2):582-95. doi: 10.1021/ci034203t.

Does rational selection of training and test sets improve the outcome of QSAR modeling?训练集和测试集的合理选择是否能提高 QSAR 建模的结果？

J Chem Inf Model. 2012 Oct 22;52(10):2570-8. doi: 10.1021/ci300338w. Epub 2012 Oct 3.

Rational selection of training and test sets for the development of validated QSAR models.为开发经过验证的定量构效关系（QSAR）模型合理选择训练集和测试集。

J Comput Aided Mol Des. 2003 Feb-Apr;17(2-4):241-53. doi: 10.1023/a:1025386326946.

QSAR modeling using chirality descriptors derived from molecular topology.使用源自分子拓扑结构的手性描述符进行定量构效关系建模。

J Chem Inf Comput Sci. 2003 Jan-Feb;43(1):144-54. doi: 10.1021/ci025516b.

Application of validated QSAR models of D1 dopaminergic antagonists for database mining.经验证的D1多巴胺能拮抗剂定量构效关系模型在数据库挖掘中的应用。

J Med Chem. 2005 Nov 17;48(23):7322-32. doi: 10.1021/jm049116m.

Combinatorial QSAR modeling of P-glycoprotein substrates.P-糖蛋白底物的组合定量构效关系建模

J Chem Inf Model. 2006 May-Jun;46(3):1245-54. doi: 10.1021/ci0504317.

Impact assessment of the rational selection of training and test sets on the predictive ability of QSAR models.基于训练集和测试集的合理选择对 QSAR 模型预测能力的影响评估。

SAR QSAR Environ Res. 2017 Dec;28(12):1011-1023. doi: 10.1080/1062936X.2017.1397056. Epub 2017 Nov 14.

Evaluation of QSAR Equations for Virtual Screening.QSAR 方程在虚拟筛选中的评估。

Int J Mol Sci. 2020 Oct 22;21(21):7828. doi: 10.3390/ijms21217828.

引用本文的文献

Multitarget Design of Steroidal Inhibitors Against Hormone-Dependent Breast Cancer: An Integrated In Silico Approach.甾体类激素依赖性乳腺癌抑制剂的多靶点设计：一种整合的计算机辅助方法

Int J Mol Sci. 2025 Aug 2;26(15):7477. doi: 10.3390/ijms26157477.

A Novel Empirical Fractional Approach for Modeling the Clogging of Membrane Filtration During Protein Microfiltration.一种用于模拟蛋白质微滤过程中膜过滤堵塞的新型经验分数方法。

Membranes (Basel). 2025 Mar 26;15(4):99. doi: 10.3390/membranes15040099.

Nanoparticle Skin Penetration: Depths and Routes Modeled In-Silico.纳米颗粒的皮肤渗透：计算机模拟的深度和途径

Small. 2025 May;21(20):e2412541. doi: 10.1002/smll.202412541. Epub 2025 Mar 27.

Application of artificial intelligence to quantitative structure-retention relationship calculations in chromatography.人工智能在色谱定量结构-保留关系计算中的应用。

J Pharm Anal. 2025 Jan;15(1):101155. doi: 10.1016/j.jpha.2024.101155. Epub 2024 Nov 26.

Quantitative physics-physiology relationship modeling of human emotional response to Shu music.人类对舒曼音乐情绪反应的定量物理-生理关系建模

Front Psychol. 2024 Oct 8;15:1351058. doi: 10.3389/fpsyg.2024.1351058. eCollection 2024.

QSAR, ADMET, molecular docking, and dynamics studies of 1,2,4-triazine-3(2H)-one derivatives as tubulin inhibitors for breast cancer therapy.1,2,4-三嗪-3(2H)-酮衍生物作为乳腺癌治疗的微管蛋白抑制剂的 QSAR、ADMET、分子对接和动力学研究。

Sci Rep. 2024 Jul 16;14(1):16418. doi: 10.1038/s41598-024-66877-2.

Design of new molecules against cervical cancer using DFT, theoretical spectroscopy, 2D/3D-QSAR, molecular docking, pharmacophore and ADMET investigations.利用密度泛函理论（DFT）、理论光谱学、二维/三维定量构效关系（2D/3D-QSAR）、分子对接、药效团和药物代谢及毒性预测（ADMET）研究设计抗宫颈癌新分子。

Heliyon. 2024 Jan 24;10(3):e24551. doi: 10.1016/j.heliyon.2024.e24551. eCollection 2024 Feb 15.

Clustering of atoms relative to vector space in the Z-matrix coordinate system and 'graphical fingerprint' analysis of 3D pharmacophore structure.Z矩阵坐标系中原子相对于向量空间的聚类以及三维药效团结构的“图形指纹”分析。

Mol Divers. 2024 Dec;28(6):4087-4104. doi: 10.1007/s11030-023-10798-1. Epub 2024 Jan 28.

Exploration of Structure-Activity Relationship Using Integrated Structure and Ligand Based Approach: Hydroxamic Acid-Based HDAC Inhibitors and Cytotoxic Agents.基于整合结构和配体方法的构效关系探索：基于异羟肟酸的组蛋白去乙酰化酶抑制剂和细胞毒性剂

Turk J Pharm Sci. 2023 Aug 22;20(4):270-284. doi: 10.4274/tjps.galenos.2022.12269.

Contribution of Reliable Chromatographic Data in QSAR for Modelling Bisphenol Transport across the Human Placenta Barrier.可靠色谱数据在 QSAR 模型中对模拟双酚类物质穿过人胎盘屏障的传输的贡献。

Molecules. 2023 Jan 4;28(2):500. doi: 10.3390/molecules28020500.

本文引用的文献

Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins.比较分子场分析（CoMFA）。1. 形状对类固醇与载体蛋白结合的影响。

J Am Chem Soc. 1988 Aug 1;110(18):5959-67. doi: 10.1021/ja00226a005.

Structural determination of paraffin boiling points.石蜡沸点的结构测定

J Am Chem Soc. 1947 Jan;69(1):17-20. doi: 10.1021/ja01193a005.

Beware of q2!小心q2！

J Mol Graph Model. 2002 Jan;20(4):269-76. doi: 10.1016/s1093-3263(01)00123-1.

QSAR and k-nearest neighbor classification analysis of selective cyclooxygenase-2 inhibitors using topologically-based numerical descriptors.

J Chem Inf Comput Sci. 2001 Nov-Dec;41(6):1553-60. doi: 10.1021/ci010073h.

Toward an optimal procedure for variable selection and QSAR model building.迈向变量选择和定量构效关系（QSAR）模型构建的优化程序。

J Chem Inf Comput Sci. 2001 Sep-Oct;41(5):1218-27. doi: 10.1021/ci010291a.

Quantitative structure-antitumor activity relationships of camptothecin analogues: cluster analysis and genetic algorithm-based studies.喜树碱类似物的定量构效关系：聚类分析和基于遗传算法的研究

J Med Chem. 2001 Sep 27;44(20):3254-63. doi: 10.1021/jm0005151.

QSAR for boiling points of "small" sulfides. Are the "high-quality structure-property-activity regressions" the real high quality QSAR models?

J Chem Inf Comput Sci. 2001 Jul-Aug;41(4):1022-7. doi: 10.1021/ci0001637.

Adaptive neuro-fuzzy inference system: an instant and architecture-free predictor for improved QSAR studies.

J Med Chem. 2001 Aug 16;44(17):2772-83. doi: 10.1021/jm000226c.

Volume learning algorithm artificial neural networks for 3D QSAR studies.用于3D QSAR研究的体积学习算法人工神经网络。

J Med Chem. 2001 Jul 19;44(15):2411-20. doi: 10.1021/jm010858e.

Classification of environmental estrogens by physicochemical properties using principal component analysis and hierarchical cluster analysis.利用主成分分析和层次聚类分析按理化性质对环境雌激素进行分类

J Chem Inf Comput Sci. 2001 May-Jun;41(3):718-26. doi: 10.1021/ci000333f.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于实验数据集的多样性采样进行训练集和测试集选择的预测性定量构效关系建模。

Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献