Witten Daniela M, Tibshirani Robert J
Stanford University, USA.
Stat Appl Genet Mol Biol. 2009;8(1):Article28. doi: 10.2202/1544-6115.1470. Epub 2009 Jun 9.
In recent work, several authors have introduced methods for sparse canonical correlation analysis (sparse CCA). Suppose that two sets of measurements are available on the same set of observations. Sparse CCA is a method for identifying sparse linear combinations of the two sets of variables that are highly correlated with each other. It has been shown to be useful in the analysis of high-dimensional genomic data, when two sets of assays are available on the same set of samples. In this paper, we propose two extensions to the sparse CCA methodology. (1) Sparse CCA is an unsupervised method; that is, it does not make use of outcome measurements that may be available for each observation (e.g., survival time or cancer subtype). We propose an extension to sparse CCA, which we call sparse supervised CCA, which results in the identification of linear combinations of the two sets of variables that are correlated with each other and associated with the outcome. (2) It is becoming increasingly common for researchers to collect data on more than two assays on the same set of samples; for instance, SNP, gene expression, and DNA copy number measurements may all be available. We develop sparse multiple CCA in order to extend the sparse CCA methodology to the case of more than two data sets. We demonstrate these new methods on simulated data and on a recently published and publicly available diffuse large B-cell lymphoma data set.
在最近的研究中,几位作者介绍了稀疏典型相关分析(sparse CCA)方法。假设有关于同一组观测对象的两组测量数据。稀疏典型相关分析是一种用于识别两组变量之间高度相关的稀疏线性组合的方法。当在同一组样本上有两组检测数据时,它已被证明在高维基因组数据分析中很有用。在本文中,我们提出了对稀疏典型相关分析方法的两种扩展。(1)稀疏典型相关分析是一种无监督方法;也就是说,它不利用可能适用于每个观测对象的结果测量值(例如,生存时间或癌症亚型)。我们提出了对稀疏典型相关分析的一种扩展,我们称之为稀疏监督典型相关分析,它能识别出两组相互关联且与结果相关的变量的线性组合。(2)研究人员在同一组样本上收集超过两组检测数据的情况越来越普遍;例如,单核苷酸多态性(SNP)、基因表达和DNA拷贝数测量数据可能都有。我们开发了稀疏多重典型相关分析,以便将稀疏典型相关分析方法扩展到多于两个数据集的情况。我们在模拟数据以及最近发表的公开可用的弥漫性大B细胞淋巴瘤数据集上展示了这些新方法。