Lock Eric F
Division of Biostatistics and Health Data Science, School of Public Health, University of Minnesota, Minneapolis, 55455, MN, USA.
Mach Learn. 2024 Oct;113(10):7451-7477. doi: 10.1007/s10994-024-06599-8. Epub 2024 Aug 7.
Data for several applications in diverse fields can be represented as multiple matrices that are linked across rows or columns. This is particularly common in molecular biomedical research, in which multiple molecular "omics" technologies may capture different feature sets (e.g., corresponding to rows in a matrix) and/or different sample populations (corresponding to columns). This has motivated a large body of work on integrative matrix factorization approaches that identify and decompose low-dimensional signal that is shared across multiple matrices or specific to a given matrix. We propose an empirical variational Bayesian approach to this problem that has several advantages over existing techniques, including the flexibility to accommodate shared signal over any number of row or column sets (i.e., bidimensional integration), an intuitive model-based objective function that yields appropriate shrinkage for the inferred signals, and a relatively efficient estimation algorithm with no tuning parameters. A general result establishes conditions for the uniqueness of the underlying decomposition for a broad family of methods that includes the proposed approach. For scenarios with missing data, we describe an associated iterative imputation approach that is novel for the single-matrix context and a powerful approach for "blockwise" imputation (in which an entire row or column is missing) in various linked matrix contexts. Extensive simulations show that the method performs very well under different scenarios with respect to recovering underlying low-rank signal, accurately decomposing shared and specific signals, and accurately imputing missing data. The approach is applied to gene expression and miRNA data from breast cancer tissue and normal breast tissue, for which it gives an informative decomposition of variation and outperforms alternative strategies for missing data imputation.
不同领域中多个应用的数据可以表示为跨行或跨列链接的多个矩阵。这在分子生物医学研究中尤为常见,其中多种分子“组学”技术可能会捕获不同的特征集(例如,对应于矩阵中的行)和/或不同的样本群体(对应于列)。这推动了大量关于整合矩阵分解方法的研究工作,这些方法用于识别和分解跨多个矩阵共享或特定于给定矩阵的低维信号。我们针对此问题提出了一种经验变分贝叶斯方法,该方法相对于现有技术具有多个优点,包括能够灵活适应任意数量的行集或列集上的共享信号(即二维整合)、基于直观模型的目标函数,该函数能对推断信号产生适当的收缩,以及一种无需调整参数的相对高效的估计算法。一个一般性结果为包括所提出方法在内的一大类方法的潜在分解唯一性建立了条件。对于存在缺失数据的情况,我们描述了一种相关的迭代插补方法,该方法在单矩阵情况下是新颖的,并且在各种链接矩阵情况下是用于“逐块”插补(其中整行或整列缺失) 的强大方法。广泛的模拟表明,该方法在恢复潜在低秩信号、准确分解共享和特定信号以及准确插补缺失数据方面,在不同场景下表现都非常出色。该方法应用于来自乳腺癌组织和正常乳腺组织的基因表达和miRNA数据,它对变异进行了有信息价值的分解,并且在缺失数据插补方面优于替代策略。