Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, MN 55414, United States.
Biometrics. 2024 Jan 29;80(1). doi: 10.1093/biomtc/ujad002.
Statistical approaches that successfully combine multiple datasets are more powerful, efficient, and scientifically informative than separate analyses. To address variation architectures correctly and comprehensively for high-dimensional data across multiple sample sets (ie, cohorts), we propose multiple augmented reduced rank regression (maRRR), a flexible matrix regression and factorization method to concurrently learn both covariate-driven and auxiliary structured variations. We consider a structured nuclear norm objective that is motivated by random matrix theory, in which the regression or factorization terms may be shared or specific to any number of cohorts. Our framework subsumes several existing methods, such as reduced rank regression and unsupervised multimatrix factorization approaches, and includes a promising novel approach to regression and factorization of a single dataset (aRRR) as a special case. Simulations demonstrate substantial gains in power from combining multiple datasets, and from parsimoniously accounting for all structured variations. We apply maRRR to gene expression data from multiple cancer types (ie, pan-cancer) from The Cancer Genome Atlas, with somatic mutations as covariates. The method performs well with respect to prediction and imputation of held-out data, and provides new insights into mutation-driven and auxiliary variations that are shared or specific to certain cancer types.
与单独的分析相比,成功结合多个数据集的统计方法更加强大、高效和具有科学信息量。为了正确全面地解决多个样本集(即队列)中高维数据的变化结构,我们提出了多增强降秩回归(maRRR),这是一种灵活的矩阵回归和分解方法,可以同时学习协变量驱动和辅助结构变化。我们考虑了一种基于随机矩阵理论的结构化核范数目标,其中回归或分解项可以共享或特定于任意数量的队列。我们的框架包含了几种现有方法,例如降秩回归和无监督多矩阵分解方法,并包括一种有前途的针对单个数据集的回归和分解的新方法(aRRR)作为特例。模拟结果表明,从多个数据集的组合中以及从所有结构变化的简约考虑中可以获得实质性的功效提升。我们将 maRRR 应用于来自癌症基因组图谱的多个癌症类型(即泛癌)的基因表达数据,并将体细胞突变作为协变量。该方法在保留数据的预测和插补方面表现良好,并提供了有关突变驱动和辅助变化的新见解,这些变化是共享的或特定于某些癌症类型的。