Department of Computer Science and the Center for Evolutionary Medicine and Informatics (CEMI) of The Biodesign Institute, Arizona State University, Tempe, AZ 85287, USA.
IEEE Trans Pattern Anal Mach Intell. 2011 Jan;33(1):194-200. doi: 10.1109/TPAMI.2010.160.
Canonical Correlation Analysis (CCA) is a well-known technique for finding the correlations between two sets of multidimensional variables. It projects both sets of variables onto a lower-dimensional space in which they are maximally correlated. CCA is commonly applied for supervised dimensionality reduction in which the two sets of variables are derived from the data and the class labels, respectively. It is well-known that CCA can be formulated as a least-squares problem in the binary class case. However, the extension to the more general setting remains unclear. In this paper, we show that under a mild condition which tends to hold for high-dimensional data, CCA in the multilabel case can be formulated as a least-squares problem. Based on this equivalence relationship, efficient algorithms for solving least-squares problems can be applied to scale CCA to very large data sets. In addition, we propose several CCA extensions, including the sparse CCA formulation based on the 1-norm regularization. We further extend the least-squares formulation to partial least squares. In addition, we show that the CCA projection for one set of variables is independent of the regularization on the other set of multidimensional variables, providing new insights on the effect of regularization on CCA. We have conducted experiments using benchmark data sets. Experiments on multilabel data sets confirm the established equivalence relationships. Results also demonstrate the effectiveness and efficiency of the proposed CCA extensions.
典型相关分析(CCA)是一种用于寻找两组多维变量之间相关性的知名技术。它将两组变量投影到一个低维空间中,在这个空间中它们的相关性最大。CCA 通常应用于有监督降维,其中两组变量分别来自数据和类别标签。众所周知,CCA 可以在二分类情况下被公式化为最小二乘问题。然而,这种扩展到更一般的情况仍然不清楚。在本文中,我们证明了在一个倾向于在高维数据中成立的温和条件下,多标签情况下的 CCA 可以被公式化为最小二乘问题。基于这种等价关系,可以应用解决最小二乘问题的高效算法将 CCA 扩展到非常大的数据集。此外,我们提出了几种 CCA 扩展,包括基于 1-范数正则化的稀疏 CCA 公式。我们进一步将最小二乘公式扩展到偏最小二乘。此外,我们表明一组变量的 CCA 投影与另一组多维变量的正则化无关,这为正则化对 CCA 的影响提供了新的见解。我们使用基准数据集进行了实验。多标签数据集上的实验证实了已建立的等价关系。结果还表明了所提出的 CCA 扩展的有效性和效率。