Department of Statistics, Oklahoma State University, Stillwater, Oklahoma, USA.
Department of Statistics, Texas A&M University, College Station, Texas, USA.
Biometrics. 2023 Dec;79(4):2933-2946. doi: 10.1111/biom.13893. Epub 2023 Jun 22.
The prevalence of data collected on the same set of samples from multiple sources (i.e., multi-view data) has prompted significant development of data integration methods based on low-rank matrix factorizations. These methods decompose signal matrices from each view into the sum of shared and individual structures, which are further used for dimension reduction, exploratory analyses, and quantifying associations across views. However, existing methods have limitations in modeling partially-shared structures due to either too restrictive models, or restrictive identifiability conditions. To address these challenges, we propose a new formulation for signal structures that include partially-shared signals based on grouping the views into so-called hierarchical levels with identifiable guarantees under suitable conditions. The proposed hierarchy leads us to introduce a new penalty, hierarchical nuclear norm (HNN), for signal estimation. In contrast to existing methods, HNN penalization avoids scores and loadings factorization of the signals and leads to a convex optimization problem, which we solve using a dual forward-backward algorithm. We propose a simple refitting procedure to adjust the penalization bias and develop an adapted version of bi-cross-validation for selecting tuning parameters. Extensive simulation studies and analysis of the genotype-tissue expression data demonstrate the advantages of our method over existing alternatives.
多视图数据中来自多个来源的同一组样本的数据收集已经非常普遍,这促使基于低秩矩阵分解的数据集成方法得到了很大的发展。这些方法将来自每个视图的信号矩阵分解为共享和个体结构的和,进一步用于降维、探索性分析和量化视图之间的关联。然而,由于模型过于严格或可识别性条件的限制,现有的方法在建模部分共享结构方面存在局限性。为了解决这些挑战,我们提出了一种新的信号结构表示形式,该表示形式基于将视图分组到所谓的层次结构中,在适当的条件下具有可识别的保证,包括部分共享信号。所提出的层次结构促使我们引入了一种新的信号估计惩罚项,即层次核范数(HNN)。与现有方法不同,HNN 惩罚避免了信号的得分和载荷的分解,并导致一个凸优化问题,我们使用对偶前向-后向算法来解决这个问题。我们提出了一种简单的重新拟合过程来调整惩罚偏差,并开发了一种适应的双交叉验证版本来选择调整参数。广泛的模拟研究和基因型组织表达数据的分析表明,我们的方法优于现有替代方法。