Chalise Prabhakar, Fridley Brooke L
Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905.
Comput Stat Data Anal. 2012 Feb 1;56(2):245-254. doi: 10.1016/j.csda.2011.07.012.
Canonical correlation analysis (CCA) is a widely used multivariate method for assessing the association between two sets of variables. However, when the number of variables far exceeds the number of subjects, such in the case of large-scale genomic studies, the traditional CCA method is not appropriate. In addition, when the variables are highly correlated the sample covariance matrices become unstable or undefined. To overcome these two issues, sparse canonical correlation analysis (SCCA) for multiple data sets has been proposed using a Lasso type of penalty. However, these methods do not have direct control over sparsity of solution. An additional step that uses Bayesian Information Criterion (BIC) has also been suggested to further filter out unimportant features. In this paper, a comparison of four penalty functions (Lasso, Elastic-net, SCAD and Hard-threshold) for SCCA with and without the BIC filtering step have been carried out using both real and simulated genotypic and mRNA expression data. This study indicates that the SCAD penalty with BIC filter would be a preferable penalty function for application of SCCA to genomic data.
典型相关分析(CCA)是一种广泛应用的多变量方法,用于评估两组变量之间的关联。然而,当变量数量远远超过样本数量时,如在大规模基因组研究中,传统的CCA方法并不适用。此外,当变量高度相关时,样本协方差矩阵会变得不稳定或无定义。为了克服这两个问题,已提出使用套索(Lasso)型惩罚的多数据集稀疏典型相关分析(SCCA)。然而,这些方法无法直接控制解的稀疏性。还建议使用贝叶斯信息准则(BIC)的额外步骤来进一步筛选出不重要的特征。在本文中,使用真实和模拟的基因型及mRNA表达数据,对有和没有BIC过滤步骤的SCCA的四种惩罚函数(套索、弹性网络、平滑截断绝对偏差和硬阈值)进行了比较。本研究表明,带有BIC过滤器 的平滑截断绝对偏差惩罚将是SCCA应用于基因组数据时更可取的惩罚函数。