Wilms Ines, Croux Christophe
Leuven Statistics Research Centre (LStat), KU Leuven, Naamsestraat 69, Leuven, 3000, Belgium.
BMC Syst Biol. 2016 Aug 11;10(1):72. doi: 10.1186/s12918-016-0317-9.
Canonical correlation analysis (CCA) is a multivariate statistical method which describes the associations between two sets of variables. The objective is to find linear combinations of the variables in each data set having maximal correlation. In genomics, CCA has become increasingly important to estimate the associations between gene expression data and DNA copy number change data. The identification of such associations might help to increase our understanding of the development of diseases such as cancer. However, these data sets are typically high-dimensional, containing a lot of variables relative to the number of objects. Moreover, the data sets might contain atypical observations since it is likely that objects react differently to treatments. We discuss a method for Robust Sparse CCA, thereby providing a solution to both issues. Sparse estimation produces canonical vectors with some of their elements estimated as exactly zero. As such, their interpretability is improved. Robust methods can cope with atypical observations in the data.
We illustrate the good performance of the Robust Sparse CCA method by several simulation studies and three biometric examples. Robust Sparse CCA considerably outperforms its main alternatives in (1) correctly detecting the main associations between the data sets, in (2) accurately estimating these associations, and in (3) detecting outliers.
Robust Sparse CCA delivers interpretable canonical vectors, while at the same time coping with outlying observations. The proposed method is able to describe the associations between high-dimensional data sets, which are nowadays commonplace in genomics. Furthermore, the Robust Sparse CCA method allows to characterize outliers.
典型相关分析(CCA)是一种多变量统计方法,用于描述两组变量之间的关联。其目的是找到每个数据集中具有最大相关性的变量线性组合。在基因组学中,CCA对于估计基因表达数据和DNA拷贝数变化数据之间的关联变得越来越重要。识别这种关联可能有助于增进我们对诸如癌症等疾病发展的理解。然而,这些数据集通常是高维的,相对于对象数量而言包含大量变量。此外,数据集中可能包含非典型观测值,因为对象对处理的反应可能不同。我们讨论了一种稳健稀疏CCA方法,从而为这两个问题提供了解决方案。稀疏估计产生的典型向量中,其一些元素被估计为恰好为零。因此,它们的可解释性得到了提高。稳健方法可以应对数据中的非典型观测值。
我们通过几个模拟研究和三个生物统计学实例说明了稳健稀疏CCA方法的良好性能。稳健稀疏CCA在以下方面明显优于其主要替代方法:(1)正确检测数据集之间的主要关联;(2)准确估计这些关联;(3)检测异常值。
稳健稀疏CCA提供了可解释的典型向量,同时能够应对异常观测值。所提出的方法能够描述高维数据集之间的关联,而高维数据集在当今基因组学中很常见。此外,稳健稀疏CCA方法还能够对异常值进行特征描述。