Poythress J C, Park Cheolwoo, Ahn Jeongyoun
Department of Mathematics and Statistics, University of New Hampshire, Durham, NH, USA.
Department of Mathematical Sciences, KAIST, Daejeon, The Republic of Korea.
J Appl Stat. 2021 Aug 19;49(15):3889-3907. doi: 10.1080/02664763.2021.1967892. eCollection 2022.
Many research proposals involve collecting multiple sources of information from a set of common samples, with the goal of performing an integrative analysis describing the associations between sources. We propose a method that characterizes the dominant modes of co-variation between the variables in two datasets while simultaneously performing variable selection. Our method relies on a sparse, low rank approximation of a matrix containing pairwise measures of association between the two sets of variables. We show that the proposed method shares a close connection with another group of methods for integrative data analysis - sparse canonical correlation analysis (CCA). Under some assumptions, the proposed method and sparse CCA aim to select the same subsets of variables. We show through simulation that the proposed method can achieve better variable selection accuracies than two state-of-the-art sparse CCA algorithms. Empirically, we demonstrate through the analysis of DNA methylation and gene expression data that the proposed method selects variables that have as high or higher canonical correlation than the variables selected by sparse CCA methods, which is a rather surprising finding given that objective function of the proposed method does not actually maximize the canonical correlation.
许多研究提案涉及从一组共同样本中收集多种信息来源,目的是进行综合分析以描述各来源之间的关联。我们提出了一种方法,该方法在对两个数据集中的变量间的共变主导模式进行表征的同时执行变量选择。我们的方法依赖于一个矩阵的稀疏、低秩近似,该矩阵包含两组变量之间的成对关联度量。我们表明,所提出的方法与另一组用于综合数据分析的方法——稀疏典型相关分析(CCA)有着密切联系。在某些假设下,所提出的方法和稀疏CCA旨在选择相同的变量子集。我们通过模拟表明,所提出的方法比两种最先进的稀疏CCA算法能实现更好的变量选择精度。从经验上看,我们通过对DNA甲基化和基因表达数据的分析证明,所提出的方法选择的变量具有与稀疏CCA方法选择的变量相同或更高的典型相关性,鉴于所提出方法的目标函数实际上并未最大化典型相关性,这是一个相当令人惊讶的发现。