Lock Eric F, Park Jun Young, Hoadley Katherine A
Division of Biostatistics, School of Public Health, University of Minnesota.
Department of Statistical Sciences, Faculty of Arts & Science, University of Toronto.
Ann Appl Stat. 2022 Mar;16(1):193-215. doi: 10.1214/21-AOAS1495. Epub 2022 Mar 28.
Several modern applications require the integration of multiple large data matrices that have shared rows and/or columns. For example, cancer studies that integrate multiple omics platforms across multiple types of cancer, , have extended our knowledge of molecular heterogeneity beyond what was observed in single tumor and single platform studies. However, these studies have been limited by available statistical methodology. We propose a flexible approach to the simultaneous factorization and decomposition of variation across such matrices, BIDIFAC+. BIDIFAC+ decomposes variation into a series of low-rank components that may be shared across any number of row sets (e.g., omics platforms) or column sets (e.g., cancer types). This builds on a growing literature for the factorization and decomposition of linked matrices which has primarily focused on multiple matrices that are linked in one dimension (rows or columns) only. Our objective function extends nuclear norm penalization, is motivated by random matrix theory, gives a unique decomposition under relatively mild conditions, and can be shown to give the mode of a Bayesian posterior distribution. We apply BIDIFAC+ to pan-omics pan-cancer data from TCGA, identifying shared and specific modes of variability across different omics platforms and 29 different cancer types.
一些现代应用需要整合多个具有共享行和/或列的大数据矩阵。例如,整合多种癌症类型的多个组学平台的癌症研究,已经扩展了我们对分子异质性的认识,超出了单肿瘤和单平台研究所观察到的范围。然而,这些研究受到现有统计方法的限制。我们提出了一种灵活的方法,用于同时对这类矩阵进行因子分解和变异分解,即BIDIFAC+。BIDIFAC+将变异分解为一系列低秩分量,这些分量可以在任意数量的行集(例如,组学平台)或列集(例如,癌症类型)之间共享。这建立在不断增长的关于链接矩阵因子分解和分解的文献基础上,这些文献主要关注仅在一个维度(行或列)上链接的多个矩阵。我们的目标函数扩展了核范数惩罚,受随机矩阵理论的启发,在相对温和的条件下给出唯一分解,并且可以证明它给出了贝叶斯后验分布的模式。我们将BIDIFAC+应用于来自TCGA的泛组学泛癌症数据,识别了不同组学平台和29种不同癌症类型之间共享和特定的变异模式。