为了更好地进行 QSAR/QSPR 建模：使用模型特征分布同时进行异常值检测和变量选择。

Toward better QSAR/QSPR modeling: simultaneous outlier detection and variable selection using distribution of model features.

机构信息

Research Center of Modernization of Traditional Chinese Medicines, Central South University, Changsha, 410083, People's Republic of China.

出版信息

J Comput Aided Mol Des. 2011 Jan;25(1):67-80. doi: 10.1007/s10822-010-9401-1. Epub 2010 Nov 13.

DOI:10.1007/s10822-010-9401-1

PMID:21076934

Abstract

Building a robust and reliable QSAR/QSPR model should greatly consider two aspects: selecting the optimal variable subset from a large pool of molecular descriptors and detecting outliers from a pool of samples. The two problems have the specific similarity and complementarity to some extent. Given a particular learning algorithm on a particular data set, one should consider how the interaction could happen between variable selection and outlier detection. In this paper, we describe a consistent methodology for simultaneously performing variable subset selection and outlier detection using the idea of statistical distribution which can be simulated by the establishment of many cross-predictive linear models. The approach exploits the fact that the distribution of linear model coefficients provides a mechanism for ranking and interpreting the effects of variable, while the distribution of prediction errors provides a mechanism for differentiating the outliers from normal samples. The use of statistic of these distributions, namely mean value and standard deviation, inherently provides a feasible way to effectively describe the information contained by the original samples. Several examples are used to demonstrate the prediction ability of our proposed approach through the comparison of different approaches as well as their combinations.

摘要

构建一个稳健可靠的定量构效关系（QSAR）/定量构性关系（QSPR）模型，应该充分考虑两个方面：从大量分子描述符中选择最佳变量子集，以及从样本集中检测异常值。这两个问题在某种程度上具有特定的相似性和互补性。给定特定的学习算法和特定的数据集合，人们应该考虑变量选择和异常值检测之间的相互作用。在本文中，我们描述了一种使用统计分布思想同时进行变量子集选择和异常值检测的一致方法，该思想可以通过建立许多交叉预测线性模型来模拟。该方法利用了线性模型系数的分布为变量的排序和解释提供了一种机制，而预测误差的分布为区分异常值和正常样本提供了一种机制。这些分布的统计量，即平均值和标准差的使用，为有效地描述原始样本所包含的信息提供了一种可行的方法。通过比较不同方法及其组合，我们使用了几个示例来说明我们提出的方法的预测能力。

相似文献

Toward better QSAR/QSPR modeling: simultaneous outlier detection and variable selection using distribution of model features.为了更好地进行 QSAR/QSPR 建模：使用模型特征分布同时进行异常值检测和变量选择。

J Comput Aided Mol Des. 2011 Jan;25(1):67-80. doi: 10.1007/s10822-010-9401-1. Epub 2010 Nov 13.

A new strategy of outlier detection for QSAR/QSPR.一种新的 QSAR/QSPR 异常值检测策略。

J Comput Chem. 2010 Feb;31(3):592-602. doi: 10.1002/jcc.21351.

An enhanced Monte Carlo outlier detection method.一种增强的蒙特卡洛异常值检测方法。

J Comput Chem. 2015 Sep 30;36(25):1902-6. doi: 10.1002/jcc.24026. Epub 2015 Jul 31.

Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection.针对梨形四膜虫的环境毒性定量构效关系（QSAR）模型的批判性评估：聚焦适用域及变量选择导致的过拟合问题

J Chem Inf Model. 2008 Sep;48(9):1733-46. doi: 10.1021/ci800151m. Epub 2008 Aug 26.

The model adaptive space shrinkage (MASS) approach: a new method for simultaneous variable selection and outlier detection based on model population analysis.模型自适应空间收缩（MASS）方法：一种基于模型群体分析的同时变量选择和异常值检测的新方法。

Analyst. 2016 Oct 7;141(19):5586-97. doi: 10.1039/c6an00764c. Epub 2016 Jul 20.

The Monte Carlo Method as a Tool to Build up Predictive QSPR/QSAR.作为构建预测性定量构效关系/定量结构活性关系工具的蒙特卡罗方法

Curr Comput Aided Drug Des. 2020;16(3):197-206. doi: 10.2174/1573409915666190328123112.

Current approaches for choosing feature selection and learning algorithms in quantitative structure-activity relationships (QSAR).当前用于定量构效关系 (QSAR) 中选择特征选择和学习算法的方法。

Expert Opin Drug Discov. 2018 Dec;13(12):1075-1089. doi: 10.1080/17460441.2018.1542428. Epub 2018 Nov 3.

Robust cross-validation of linear regression QSAR models.线性回归定量构效关系模型的稳健交叉验证

J Chem Inf Model. 2008 Oct;48(10):2081-94. doi: 10.1021/ci800209k. Epub 2008 Oct 1.

A Multi-Objective Genetic Algorithm for Outlier Removal.一种用于异常值移除的多目标遗传算法。

J Chem Inf Model. 2015 Dec 28;55(12):2507-18. doi: 10.1021/acs.jcim.5b00515. Epub 2015 Nov 23.

Genetic Algorithm and Self-Organizing Maps for QSPR Study of Some N-aryl Derivatives as Butyrylcholinesterase Inhibitors.用于某些N-芳基衍生物作为丁酰胆碱酯酶抑制剂的定量构效关系研究的遗传算法和自组织映射

Curr Drug Discov Technol. 2016;13(4):232-253. doi: 10.2174/1570163813666160725114241.

引用本文的文献

Free and open-source QSAR-ready workflow for automated standardization of chemical structures in support of QSAR modeling.用于化学结构自动标准化以支持定量构效关系建模的免费开源且适用于定量构效关系的工作流程。

J Cheminform. 2024 Feb 20;16(1):19. doi: 10.1186/s13321-024-00814-3.

Improvement of the Prediction Power of the CoMFA and CoMSIA Models on Histamine H3 Antagonists by Different Variable Selection Methods.通过不同变量选择方法提高比较分子场分析（CoMFA）和比较分子相似性指数分析（CoMSIA）模型对组胺H3拮抗剂的预测能力。

Sci Pharm. 2012 Jul-Sep;80(3):547-66. doi: 10.3797/scipharm.1204-19. Epub 2012 May 24.

3D-QSPR method of computational technique applied on red reactive dyes by using CoMFA strategy.采用CoMFA策略将计算技术的3D-QSPR方法应用于红色活性染料。

Int J Mol Sci. 2011;12(12):8862-77. doi: 10.3390/ijms12128862. Epub 2011 Dec 5.

本文引用的文献

Elimination of uninformative variables for multivariate calibration.消除多变量校准中的无信息变量。

Anal Chem. 1996 Nov 1;68(21):3851-8. doi: 10.1021/ac960321m.

Common disorders are quantitative traits.常见疾病是数量性状。

Nat Rev Genet. 2009 Dec;10(12):872-8. doi: 10.1038/nrg2670. Epub 2009 Oct 27.

A new strategy of outlier detection for QSAR/QSPR.一种新的 QSAR/QSPR 异常值检测策略。

J Comput Chem. 2010 Feb;31(3):592-602. doi: 10.1002/jcc.21351.

Genetic algorithms for simultaneous variable and sample selection in metabonomics.代谢组学中同时进行变量和样本选择的遗传算法

Bioinformatics. 2009 Jan 1;25(1):112-8. doi: 10.1093/bioinformatics/btn586. Epub 2008 Nov 14.

Robust cross-validation of linear regression QSAR models.线性回归定量构效关系模型的稳健交叉验证

J Chem Inf Model. 2008 Oct;48(10):2081-94. doi: 10.1021/ci800209k. Epub 2008 Oct 1.

J Chem Inf Model. 2008 Sep;48(9):1733-46. doi: 10.1021/ci800151m. Epub 2008 Aug 26.

Toward robust QSPR models: Synergistic utilization of robust regression and variable elimination.迈向稳健的定量构效关系模型：稳健回归与变量消除的协同应用。

J Comput Chem. 2008 Apr 30;29(6):847-60. doi: 10.1002/jcc.20831.

Benchmarking of QSAR models for blood-brain barrier permeation.血脑屏障通透性定量构效关系模型的基准测试

J Chem Inf Model. 2007 Jul-Aug;47(4):1648-56. doi: 10.1021/ci700100f. Epub 2007 Jun 30.

Modeling robust QSAR 3: SOM-4D-QSAR with iterative variable elimination IVE-PLS: application to steroid, azo dye, and benzoic acid series.稳健定量构效关系建模3：基于迭代变量消除IVE-PLS的SOM-4D-QSAR：在甾体、偶氮染料和苯甲酸系列中的应用

J Chem Inf Model. 2007 Jul-Aug;47(4):1469-80. doi: 10.1021/ci700025m. Epub 2007 Jun 14.

The quality of QSAR models: problems and solutions.定量构效关系模型的质量：问题与解决方案。

SAR QSAR Environ Res. 2007 Jan-Mar;18(1-2):89-100. doi: 10.1080/10629360601053984.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

为了更好地进行 QSAR/QSPR 建模：使用模型特征分布同时进行异常值检测和变量选择。

Toward better QSAR/QSPR modeling: simultaneous outlier detection and variable selection using distribution of model features.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献