Wang Jiuzhou, Lock Eric F
ArXiv. 2023 Aug 30:arXiv:2308.16333v1.
Statistical approaches that successfully combine multiple datasets are more powerful, efficient, and scientifically informative than separate analyses. To address variation architectures correctly and comprehensively for high-dimensional data across multiple sample sets (i.e., cohorts), we propose multiple augmented reduced rank regression (maRRR), a flexible matrix regression and factorization method to concurrently learn both covariate-driven and auxiliary structured variation. We consider a structured nuclear norm objective that is motivated by random matrix theory, in which the regression or factorization terms may be shared or specific to any number of cohorts. Our framework subsumes several existing methods, such as reduced rank regression and unsupervised multi-matrix factorization approaches, and includes a promising novel approach to regression and factorization of a single dataset (aRRR) as a special case. Simulations demonstrate substantial gains in power from combining multiple datasets, and from parsimoniously accounting for all structured variation. We apply maRRR to gene expression data from multiple cancer types (i.e., pan-cancer) from TCGA, with somatic mutations as covariates. The method performs well with respect to prediction and imputation of held-out data, and provides new insights into mutation-driven and auxiliary variation that is shared or specific to certain cancer types.
与单独分析相比,成功整合多个数据集的统计方法更强大、高效且具有科学信息价值。为了正确且全面地处理跨多个样本集(即队列)的高维数据的变异结构,我们提出了多重增强降秩回归(maRRR),这是一种灵活的矩阵回归和分解方法,可同时学习协变量驱动的变异和辅助结构化变异。我们考虑了一个由随机矩阵理论激发的结构化核范数目标,其中回归或分解项可以在任意数量的队列中共享或特定于某个队列。我们的框架包含了几种现有方法,如降秩回归和无监督多矩阵分解方法,并将一种有前景的单数据集回归和分解新方法(aRRR)作为特殊情况包含在内。模拟结果表明,整合多个数据集以及简约地考虑所有结构化变异能显著提高功效。我们将maRRR应用于来自TCGA的多种癌症类型(即泛癌)的基因表达数据,并将体细胞突变作为协变量。该方法在对留出数据的预测和插补方面表现良好,并为某些癌症类型共享或特定的突变驱动变异和辅助变异提供了新的见解。