Liquet Benoit, Moka Sarat, Muller Samuel
School of Mathematical and Physical Sciences, Macquarie University, Sydney, Australia.
Laboratoire de Mathématiques et de leurs Applications, Université de Pau et des Pays de l'Adour, Pau, France.
Biom J. 2025 Feb;67(1):e70015. doi: 10.1002/bimj.70015.
The selection of best variables is a challenging problem in supervised and unsupervised learning, especially in high-dimensional contexts where the number of variables is usually much larger than the number of observations. In this paper, we focus on two multivariate statistical methods: principal components analysis and partial least squares. Both approaches are popular linear dimension-reduction methods with numerous applications in several fields including in genomics, biology, environmental science, and engineering. In particular, these approaches build principal components, new variables that are combinations of all the original variables. A main drawback of principal components is the difficulty to interpret them when the number of variables is large. To define principal components from the most relevant variables, we propose to cast the best subset solution path method into principal component analysis and partial least square frameworks. We offer a new alternative by exploiting a continuous optimization algorithm for best subset solution path. Empirical studies show the efficacy of our approach for providing the best subset solution path. The usage of our algorithm is further exposed through the analysis of two real data sets. The first data set is analyzed using the principle component analysis while the analysis of the second data set is based on partial least square framework.
在有监督和无监督学习中,选择最佳变量是一个具有挑战性的问题,特别是在高维环境中,变量的数量通常远大于观测值的数量。在本文中,我们专注于两种多元统计方法:主成分分析和偏最小二乘法。这两种方法都是流行的线性降维方法,在包括基因组学、生物学、环境科学和工程学在内的多个领域有大量应用。特别是,这些方法构建主成分,即由所有原始变量组合而成的新变量。主成分的一个主要缺点是当变量数量很大时难以解释它们。为了从最相关的变量中定义主成分,我们建议将最佳子集解路径方法应用于主成分分析和偏最小二乘框架。我们通过利用一种用于最佳子集解路径的连续优化算法提供了一种新的选择。实证研究表明我们的方法在提供最佳子集解路径方面的有效性。通过对两个真实数据集的分析,进一步展示了我们算法的用法。第一个数据集使用主成分分析进行分析,而第二个数据集的分析基于偏最小二乘框架。