Whitley D C, Ford M G, Livingstone D J
Centre for Molecular Design, Institute of Biomedical and Biomolecular Science, University of Portsmouth, UK.
J Chem Inf Comput Sci. 2000 Sep-Oct;40(5):1160-8. doi: 10.1021/ci000384c.
An unsupervised learning method is proposed for variable selection and its performance assessed using three typical QSAR data sets. The aims of this procedure are to generate a subset of descriptors from any given data set in which the resultant variables are relevant, redundancy is eliminated, and multicollinearity is reduced. Continuum regression, an algorithm encompassing ordinary least squares regression, regression on principal components, and partial least squares regression, was used to construct models from the selected variables. The variable selection routine is shown to produce simple, robust, and easily interpreted models for the chosen data sets.
提出了一种无监督学习方法用于变量选择,并使用三个典型的定量构效关系(QSAR)数据集评估其性能。该过程的目的是从任何给定数据集中生成一个描述符子集,其中所得变量相关、冗余消除且多重共线性降低。连续回归是一种包含普通最小二乘回归、主成分回归和偏最小二乘回归的算法,用于根据所选变量构建模型。结果表明,变量选择程序为所选数据集生成了简单、稳健且易于解释的模型。