Suppr超能文献

为了更好地进行 QSAR/QSPR 建模:使用模型特征分布同时进行异常值检测和变量选择。

Toward better QSAR/QSPR modeling: simultaneous outlier detection and variable selection using distribution of model features.

机构信息

Research Center of Modernization of Traditional Chinese Medicines, Central South University, Changsha, 410083, People's Republic of China.

出版信息

J Comput Aided Mol Des. 2011 Jan;25(1):67-80. doi: 10.1007/s10822-010-9401-1. Epub 2010 Nov 13.

Abstract

Building a robust and reliable QSAR/QSPR model should greatly consider two aspects: selecting the optimal variable subset from a large pool of molecular descriptors and detecting outliers from a pool of samples. The two problems have the specific similarity and complementarity to some extent. Given a particular learning algorithm on a particular data set, one should consider how the interaction could happen between variable selection and outlier detection. In this paper, we describe a consistent methodology for simultaneously performing variable subset selection and outlier detection using the idea of statistical distribution which can be simulated by the establishment of many cross-predictive linear models. The approach exploits the fact that the distribution of linear model coefficients provides a mechanism for ranking and interpreting the effects of variable, while the distribution of prediction errors provides a mechanism for differentiating the outliers from normal samples. The use of statistic of these distributions, namely mean value and standard deviation, inherently provides a feasible way to effectively describe the information contained by the original samples. Several examples are used to demonstrate the prediction ability of our proposed approach through the comparison of different approaches as well as their combinations.

摘要

构建一个稳健可靠的定量构效关系(QSAR)/定量构性关系(QSPR)模型,应该充分考虑两个方面:从大量分子描述符中选择最佳变量子集,以及从样本集中检测异常值。这两个问题在某种程度上具有特定的相似性和互补性。给定特定的学习算法和特定的数据集合,人们应该考虑变量选择和异常值检测之间的相互作用。在本文中,我们描述了一种使用统计分布思想同时进行变量子集选择和异常值检测的一致方法,该思想可以通过建立许多交叉预测线性模型来模拟。该方法利用了线性模型系数的分布为变量的排序和解释提供了一种机制,而预测误差的分布为区分异常值和正常样本提供了一种机制。这些分布的统计量,即平均值和标准差的使用,为有效地描述原始样本所包含的信息提供了一种可行的方法。通过比较不同方法及其组合,我们使用了几个示例来说明我们提出的方法的预测能力。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验