Shu Hai, Qu Zhe
Department of Biostatistics, School of Global Public Health, New York University.
Department of Mathematics, School of Science and Engineering, Tulane University.
Electron J Stat. 2022;16(1):2475-2517. doi: 10.1214/22-EJS2008. Epub 2022 Apr 4.
A representative model in integrative analysis of two high-dimensional correlated datasets is to decompose each data matrix into a low-rank common matrix generated by latent factors shared across datasets, a low-rank distinctive matrix corresponding to each dataset, and an additive noise matrix. Existing decomposition methods claim that their common matrices capture the common pattern of the two datasets. However, their so-called common pattern only denotes the common latent factors but ignores the common pattern between the two coefficient matrices of these common latent factors. We propose a new unsupervised learning method, called the common and distinctive pattern analysis (CDPA), which appropriately defines the two types of data patterns by further incorporating the common and distinctive patterns of the coefficient matrices. A consistent estimation approach is developed for high-dimensional settings, and shows reasonably good finite-sample performance in simulations. Our simulation studies and real data analysis corroborate that the proposed CDPA can provide better characterization of common and distinctive patterns and thereby benefit data mining.
在对两个高维相关数据集进行综合分析时,一种具有代表性的模型是将每个数据矩阵分解为一个由跨数据集共享的潜在因子生成的低秩公共矩阵、一个对应于每个数据集的低秩独特矩阵以及一个加性噪声矩阵。现有的分解方法声称其公共矩阵捕捉了两个数据集的共同模式。然而,它们所谓的共同模式仅表示共同的潜在因子,却忽略了这些共同潜在因子的两个系数矩阵之间的共同模式。我们提出了一种新的无监督学习方法,称为共同与独特模式分析(CDPA),该方法通过进一步纳入系数矩阵的共同和独特模式来恰当地定义这两种数据模式。针对高维情形开发了一种一致估计方法,并在模拟中显示出相当不错的有限样本性能。我们的模拟研究和实际数据分析证实,所提出的CDPA能够更好地表征共同和独特模式,从而有利于数据挖掘。