O'Connell Michael J, Lock Eric F
Department of Statistics, Miami University, Oxford, Ohio 45056.
Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota 55455.
Biometrics. 2019 Jun;75(2):582-592. doi: 10.1111/biom.13010. Epub 2019 Apr 2.
Several recent methods address the dimension reduction and decomposition of linked high-content data matrices. Typically, these methods consider one dimension, rows or columns, that is shared among the matrices. This shared dimension may represent common features measured for different sample sets (horizontal integration) or a common sample set with features from different platforms (vertical integration). We introduce an approach for simultaneous horizontal and vertical integration, Linked Matrix Factorization (LMF), for the general case where some matrices share rows (e.g., features) and some share columns (e.g., samples). Our motivating application is a cytotoxicity study with accompanying genomic and molecular chemical attribute data. The toxicity matrix (cell lines chemicals) shares samples with a genotype matrix (cell lines SNPs) and shares features with a molecular attribute matrix (chemicals attributes). LMF gives a unified low-rank factorization of these three matrices, which allows for the decomposition of systematic variation that is shared and systematic variation that is specific to each matrix. This allows for efficient dimension reduction, exploratory visualization, and the imputation of missing data even when entire rows or columns are missing. We present theoretical results concerning the uniqueness, identifiability, and minimal parametrization of LMF, and evaluate it with extensive simulation studies.
最近有几种方法用于处理链接的高内涵数据矩阵的降维和分解。通常,这些方法考虑矩阵之间共享的一个维度,行或列。这个共享维度可能代表针对不同样本集测量的共同特征(水平整合),或者具有来自不同平台特征的共同样本集(垂直整合)。我们针对一些矩阵共享行(例如,特征)而一些矩阵共享列(例如,样本)的一般情况,引入了一种用于同时进行水平和垂直整合的方法,即链接矩阵分解(LMF)。我们的激励应用是一项伴随基因组和分子化学属性数据的细胞毒性研究。毒性矩阵(细胞系×化学物质)与基因型矩阵(细胞系×单核苷酸多态性)共享样本,并与分子属性矩阵(化学物质×属性)共享特征。LMF对这三个矩阵进行统一的低秩分解,这允许对共享的系统变异和每个矩阵特有的系统变异进行分解。即使当整行或整列缺失时,这也允许进行有效的降维、探索性可视化以及缺失数据的插补。我们给出了关于LMF的唯一性、可识别性和最小参数化的理论结果,并用广泛的模拟研究对其进行了评估。