Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota.
Biometrics. 2020 Mar;76(1):61-74. doi: 10.1111/biom.13141. Epub 2019 Nov 10.
Advances in molecular "omics" technologies have motivated new methodologies for the integration of multiple sources of high-content biomedical data. However, most statistical methods for integrating multiple data matrices only consider data shared vertically (one cohort on multiple platforms) or horizontally (different cohorts on a single platform). This is limiting for data that take the form of bidimensionally linked matrices (eg, multiple cohorts measured on multiple platforms), which are increasingly common in large-scale biomedical studies. In this paper, we propose bidimensional integrative factorization (BIDIFAC) for integrative dimension reduction and signal approximation of bidimensionally linked data matrices. Our method factorizes data into (a) globally shared, (b) row-shared, (c) column-shared, and (d) single-matrix structural components, facilitating the investigation of shared and unique patterns of variability. For estimation, we use a penalized objective function that extends the nuclear norm penalization for a single matrix. As an alternative to the complicated rank selection problem, we use results from the random matrix theory to choose tuning parameters. We apply our method to integrate two genomics platforms (messenger RNA and microRNA expression) across two sample cohorts (tumor samples and normal tissue samples) using the breast cancer data from the Cancer Genome Atlas. We provide R code for fitting BIDIFAC, imputing missing values, and generating simulated data.
分子“组学”技术的进步推动了整合多源高内涵生物医学数据的新方法的发展。然而,大多数整合多个数据矩阵的统计方法仅考虑垂直方向(一个队列在多个平台上)或水平方向(单个平台上的不同队列)共享的数据。对于采用二维链接矩阵形式的数据(例如,在多个平台上测量的多个队列),这是有限的,这种数据在大型生物医学研究中越来越常见。在本文中,我们提出了二维综合因子分析(BIDIFAC),用于二维链接数据矩阵的综合降维和信号逼近。我们的方法将数据分解为(a)全局共享、(b)行共享、(c)列共享和(d)单个矩阵结构组件,便于研究共享和独特的变异性模式。对于估计,我们使用扩展了单个矩阵的核范数惩罚的惩罚目标函数。作为复杂的秩选择问题的替代方案,我们使用随机矩阵理论的结果来选择调整参数。我们使用来自癌症基因组图谱的乳腺癌数据,将两个基因组学平台(信使 RNA 和 microRNA 表达)整合到两个样本队列(肿瘤样本和正常组织样本)中,并应用我们的方法。我们提供了用于拟合 BIDIFAC、插补缺失值和生成模拟数据的 R 代码。