Tenenhaus Arthur, Philippe Cathy, Guillemot Vincent, Le Cao Kim-Anh, Grill Jacques, Frouin Vincent
SUPELEC, Plateau de moulon, 3 rue Joliot-Curie, 91192 Gif-sur-Yvette Cedex, France
CNRS-IGR-Paris XI university, UMR8203, 94805 Villejuif cedex, France.
Biostatistics. 2014 Jul;15(3):569-83. doi: 10.1093/biostatistics/kxu001. Epub 2014 Feb 17.
Regularized generalized canonical correlation analysis (RGCCA) is a generalization of regularized canonical correlation analysis to 3 or more sets of variables. RGCCA is a component-based approach which aims to study the relationships between several sets of variables. The quality and interpretability of the RGCCA components are likely to be affected by the usefulness and relevance of the variables in each block. Therefore, it is an important issue to identify within each block which subsets of significant variables are active in the relationships between blocks. In this paper, RGCCA is extended to address the issue of variable selection. Specifically, sparse generalized canonical correlation analysis (SGCCA) is proposed to combine RGCCA with an [Formula: see text]-penalty in a unified framework. Within this framework, blocks are not necessarily fully connected, which makes SGCCA a flexible method for analyzing a wide variety of practical problems. Finally, the versatility and usefulness of SGCCA are illustrated on a simulated dataset and on a 3-block dataset which combine gene expression, comparative genomic hybridization, and a qualitative phenotype measured on a set of 53 children with glioma. SGCCA is available on CRAN as part of the RGCCA package.
正则化广义典型相关分析(RGCCA)是正则化典型相关分析对三组或更多组变量的推广。RGCCA是一种基于成分的方法,旨在研究多组变量之间的关系。RGCCA成分的质量和可解释性可能会受到每个模块中变量的有用性和相关性的影响。因此,确定每个模块中哪些显著变量子集在模块间关系中起作用是一个重要问题。本文对RGCCA进行扩展以解决变量选择问题。具体而言,提出了稀疏广义典型相关分析(SGCCA),将RGCCA与一个[公式:见原文]惩罚项在一个统一框架中相结合。在此框架内,模块不一定是完全连接的,这使得SGCCA成为分析各种实际问题的灵活方法。最后,在一个模拟数据集和一个由基因表达、比较基因组杂交以及对一组53例神经胶质瘤患儿测量的定性表型组成的三模块数据集上说明了SGCCA的通用性和实用性。SGCCA作为RGCCA包的一部分可在CRAN上获取。