数据集合的典型相关度量（CMC）和典型距离度量（CMD）。第 3 部分。分类中的变量选择。

Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 3. Variable selection in classification.

机构信息

Milano Chemometrics and QSAR Research Group, Department of Environmental Sciences, University of Milano-Bicocca, Piazza della Scienza 1, I-20126 Milano, Italy.

出版信息

Anal Chim Acta. 2010 Jan 11;657(2):116-22. doi: 10.1016/j.aca.2009.10.033.

DOI:10.1016/j.aca.2009.10.033

PMID:20005322

Abstract

In multivariate regression and classification issues variable selection is an important procedure used to select an optimal subset of variables with the aim of producing more parsimonious and eventually more predictive models. Variable selection is often necessary when dealing with methodologies that produce thousands of variables, such as Quantitative Structure-Activity Relationships (QSARs) and highly dimensional analytical procedures. In this paper a novel method for variable selection for classification purposes is introduced. This method exploits the recently proposed Canonical Measure of Correlation between two sets of variables (CMC index). The CMC index is in this case calculated for two specific sets of variables, the former being comprised of the independent variables and the latter of the unfolded class matrix. The CMC values, calculated by considering one variable at a time, can be sorted and a ranking of the variables on the basis of their class discrimination capabilities results. Alternatively, CMC index can be calculated for all the possible combinations of variables and the variable subset with the maximal CMC can be selected, but this procedure is computationally more demanding and classification performance of the selected subset is not always the best one. The effectiveness of the CMC index in selecting variables with discriminative ability was compared with that of other well-known strategies for variable selection, such as the Wilks' Lambda, the VIP index based on the Partial Least Squares-Discriminant Analysis, and the selection provided by classification trees. A variable Forward Selection based on the CMC index was finally used in conjunction of Linear Discriminant Analysis. This approach was tested on several chemical data sets. Obtained results were encouraging.

摘要

在多元回归和分类问题中，变量选择是一种重要的程序，用于选择具有最佳子集的变量，旨在生成更简洁、更具预测性的模型。当处理产生数千个变量的方法时，例如定量构效关系（QSAR）和高维分析程序，变量选择通常是必要的。本文介绍了一种用于分类目的的变量选择新方法。该方法利用了最近提出的两个变量集之间的相关系数标准测度（CMC 指数）。在这种情况下，CMC 指数是针对两组特定的变量计算的，前一组由自变量组成，后一组由展开的类别矩阵组成。通过一次考虑一个变量来计算 CMC 值，可以对它们进行排序，并根据它们的类别区分能力对变量进行排名。或者，可以计算所有可能的变量组合的 CMC 指数，并选择具有最大 CMC 的变量子集，但这种方法计算要求更高，并且所选子集的分类性能并不总是最佳的。CMC 指数在选择具有区分能力的变量方面的有效性与其他著名的变量选择策略（如 Wilks' Lambda、基于偏最小二乘判别分析的 VIP 指数以及分类树提供的选择）进行了比较。最后，基于 CMC 指数的变量向前选择与线性判别分析结合使用。该方法在多个化学数据集上进行了测试。得到的结果令人鼓舞。

相似文献

Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 3. Variable selection in classification.数据集合的典型相关度量（CMC）和典型距离度量（CMD）。第 3 部分。分类中的变量选择。

Anal Chim Acta. 2010 Jan 11;657(2):116-22. doi: 10.1016/j.aca.2009.10.033.

Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data Part 2. Variable reduction.数据集之间的典型相关性度量（CMC）和典型距离度量（CMD）第2部分。变量约简。

Anal Chim Acta. 2009 Aug 19;648(1):52-9. doi: 10.1016/j.aca.2009.06.035. Epub 2009 Jun 21.

Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 1. Theory and simple chemometric applications.数据集之间的典型相关性度量（CMC）和典型距离度量（CMD）。第1部分。理论与简单的化学计量学应用。

Anal Chim Acta. 2009 Aug 19;648(1):45-51. doi: 10.1016/j.aca.2009.06.032. Epub 2009 Jun 21.

Total ranking models by the genetic algorithm variable subset selection (GA-VSS) approach for environmental priority settings.

Anal Bioanal Chem. 2004 Oct;380(3):430-44. doi: 10.1007/s00216-004-2762-3. Epub 2004 Sep 22.

A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data.用于质谱数据分析的现代特征选择与分类方法的比较研究。

Anal Chim Acta. 2014 Jun 4;829:1-8. doi: 10.1016/j.aca.2014.03.039. Epub 2014 Mar 31.

Variable selection in discriminant partial least-squares analysis.判别式偏最小二乘分析中的变量选择

Anal Chem. 1998 Oct 1;70(19):4126-33. doi: 10.1021/ac980506o.

Recursive Random Forests Enable Better Predictive Performance and Model Interpretation than Variable Selection by LASSO.与套索变量选择相比，递归随机森林具有更好的预测性能和模型解释能力。

J Chem Inf Model. 2015 Apr 27;55(4):736-46. doi: 10.1021/ci500715e. Epub 2015 Mar 16.

Part 1. Statistical Learning Methods for the Effects of Multiple Air Pollution Constituents.第1部分. 多种空气污染成分影响的统计学习方法

Res Rep Health Eff Inst. 2015 Jun(183 Pt 1-2):5-50.

Discriminating variable test and selectivity ratio plot: quantitative tools for interpretation and variable (biomarker) selection in complex spectral or chromatographic profiles.判别变量测试和选择性比率图：用于解释复杂光谱或色谱图以及变量（生物标志物）选择的定量工具。

Anal Chem. 2009 Apr 1;81(7):2581-90. doi: 10.1021/ac802514y.

Predictive-property-ranked variable reduction in partial least squares modelling with final complexity adapted models: comparison of properties for ranking.偏最小二乘建模中基于最终复杂度适应模型的预测属性排序变量缩减：用于排序的属性比较。

Anal Chim Acta. 2013 Jan 14;760:34-45. doi: 10.1016/j.aca.2012.11.012. Epub 2012 Nov 16.

引用本文的文献

Unlocking the Potential of Clustering and Classification Approaches: Navigating Supervised and Unsupervised Chemical Similarity.解锁聚类和分类方法的潜力：探索有监督和无监督的化学相似性。

Environ Health Perspect. 2024 Aug;132(8):85002. doi: 10.1289/EHP14001. Epub 2024 Aug 6.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

数据集合的典型相关度量（CMC）和典型距离度量（CMD）。第 3 部分。分类中的变量选择。

Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 3. Variable selection in classification.

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献