Suppr超能文献

数据集合的典型相关度量(CMC)和典型距离度量(CMD)。第 3 部分。分类中的变量选择。

Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 3. Variable selection in classification.

机构信息

Milano Chemometrics and QSAR Research Group, Department of Environmental Sciences, University of Milano-Bicocca, Piazza della Scienza 1, I-20126 Milano, Italy.

出版信息

Anal Chim Acta. 2010 Jan 11;657(2):116-22. doi: 10.1016/j.aca.2009.10.033.

Abstract

In multivariate regression and classification issues variable selection is an important procedure used to select an optimal subset of variables with the aim of producing more parsimonious and eventually more predictive models. Variable selection is often necessary when dealing with methodologies that produce thousands of variables, such as Quantitative Structure-Activity Relationships (QSARs) and highly dimensional analytical procedures. In this paper a novel method for variable selection for classification purposes is introduced. This method exploits the recently proposed Canonical Measure of Correlation between two sets of variables (CMC index). The CMC index is in this case calculated for two specific sets of variables, the former being comprised of the independent variables and the latter of the unfolded class matrix. The CMC values, calculated by considering one variable at a time, can be sorted and a ranking of the variables on the basis of their class discrimination capabilities results. Alternatively, CMC index can be calculated for all the possible combinations of variables and the variable subset with the maximal CMC can be selected, but this procedure is computationally more demanding and classification performance of the selected subset is not always the best one. The effectiveness of the CMC index in selecting variables with discriminative ability was compared with that of other well-known strategies for variable selection, such as the Wilks' Lambda, the VIP index based on the Partial Least Squares-Discriminant Analysis, and the selection provided by classification trees. A variable Forward Selection based on the CMC index was finally used in conjunction of Linear Discriminant Analysis. This approach was tested on several chemical data sets. Obtained results were encouraging.

摘要

在多元回归和分类问题中,变量选择是一种重要的程序,用于选择具有最佳子集的变量,旨在生成更简洁、更具预测性的模型。当处理产生数千个变量的方法时,例如定量构效关系(QSAR)和高维分析程序,变量选择通常是必要的。本文介绍了一种用于分类目的的变量选择新方法。该方法利用了最近提出的两个变量集之间的相关系数标准测度(CMC 指数)。在这种情况下,CMC 指数是针对两组特定的变量计算的,前一组由自变量组成,后一组由展开的类别矩阵组成。通过一次考虑一个变量来计算 CMC 值,可以对它们进行排序,并根据它们的类别区分能力对变量进行排名。或者,可以计算所有可能的变量组合的 CMC 指数,并选择具有最大 CMC 的变量子集,但这种方法计算要求更高,并且所选子集的分类性能并不总是最佳的。CMC 指数在选择具有区分能力的变量方面的有效性与其他著名的变量选择策略(如 Wilks' Lambda、基于偏最小二乘判别分析的 VIP 指数以及分类树提供的选择)进行了比较。最后,基于 CMC 指数的变量向前选择与线性判别分析结合使用。该方法在多个化学数据集上进行了测试。得到的结果令人鼓舞。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验