CDPA：高维数据集之间的共性与差异模式分析

CDPA: Common and Distinctive Pattern Analysis between High-dimensional Datasets.

作者信息

Shu Hai, Qu Zhe

机构信息

Department of Biostatistics, School of Global Public Health, New York University.

Department of Mathematics, School of Science and Engineering, Tulane University.

出版信息

Electron J Stat. 2022;16(1):2475-2517. doi: 10.1214/22-EJS2008. Epub 2022 Apr 4.

DOI:10.1214/22-EJS2008

PMID:36034577

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9410619/

Abstract

A representative model in integrative analysis of two high-dimensional correlated datasets is to decompose each data matrix into a low-rank common matrix generated by latent factors shared across datasets, a low-rank distinctive matrix corresponding to each dataset, and an additive noise matrix. Existing decomposition methods claim that their common matrices capture the common pattern of the two datasets. However, their so-called common pattern only denotes the common latent factors but ignores the common pattern between the two coefficient matrices of these common latent factors. We propose a new unsupervised learning method, called the common and distinctive pattern analysis (CDPA), which appropriately defines the two types of data patterns by further incorporating the common and distinctive patterns of the coefficient matrices. A consistent estimation approach is developed for high-dimensional settings, and shows reasonably good finite-sample performance in simulations. Our simulation studies and real data analysis corroborate that the proposed CDPA can provide better characterization of common and distinctive patterns and thereby benefit data mining.

摘要

在对两个高维相关数据集进行综合分析时，一种具有代表性的模型是将每个数据矩阵分解为一个由跨数据集共享的潜在因子生成的低秩公共矩阵、一个对应于每个数据集的低秩独特矩阵以及一个加性噪声矩阵。现有的分解方法声称其公共矩阵捕捉了两个数据集的共同模式。然而，它们所谓的共同模式仅表示共同的潜在因子，却忽略了这些共同潜在因子的两个系数矩阵之间的共同模式。我们提出了一种新的无监督学习方法，称为共同与独特模式分析（CDPA），该方法通过进一步纳入系数矩阵的共同和独特模式来恰当地定义这两种数据模式。针对高维情形开发了一种一致估计方法，并在模拟中显示出相当不错的有限样本性能。我们的模拟研究和实际数据分析证实，所提出的CDPA能够更好地表征共同和独特模式，从而有利于数据挖掘。

相似文献

CDPA: Common and Distinctive Pattern Analysis between High-dimensional Datasets.CDPA：高维数据集之间的共性与差异模式分析

Electron J Stat. 2022;16(1):2475-2517. doi: 10.1214/22-EJS2008. Epub 2022 Apr 4.

D-CCA: A Decomposition-based Canonical Correlation Analysis for High-Dimensional Datasets.D-CCA：一种用于高维数据集的基于分解的典型相关分析

J Am Stat Assoc. 2020;115(529):292-306. doi: 10.1080/01621459.2018.1543599. Epub 2019 Apr 11.

D-GCCA: Decomposition-based Generalized Canonical Correlation Analysis for Multi-view High-dimensional Data.D-GCCA：用于多视图高维数据的基于分解的广义典型相关分析

J Mach Learn Res. 2022;23.

Low-rank latent matrix-factor prediction modeling for generalized high-dimensional matrix-variate regression.用于广义高维矩阵变量回归的低秩潜在矩阵因子预测建模。

Stat Med. 2023 Sep 10;42(20):3616-3635. doi: 10.1002/sim.9821. Epub 2023 Jun 14.

Exactly Robust Kernel Principal Component Analysis.精确鲁棒核主成分分析

IEEE Trans Neural Netw Learn Syst. 2020 Mar;31(3):749-761. doi: 10.1109/TNNLS.2019.2909686. Epub 2019 Apr 29.

L2RM: Low-rank Linear Regression Models for High-dimensional Matrix Responses.L2RM：用于高维矩阵响应的低秩线性回归模型

J Am Stat Assoc. 2020 Apr 30;115(529):403-424. doi: 10.1080/01621459.2018.1555092. Epub 2019 Apr 30.

Graph Regularized Non-Negative Low-Rank Matrix Factorization for Image Clustering.基于图正则化的非负低秩矩阵分解的图像聚类。

IEEE Trans Cybern. 2017 Nov;47(11):3840-3853. doi: 10.1109/TCYB.2016.2585355. Epub 2016 Jul 20.

Scalable non-negative matrix tri-factorization.可扩展非负矩阵三因子分解

BioData Min. 2017 Dec 29;10:41. doi: 10.1186/s13040-017-0160-6. eCollection 2017.

A Synergistic Approach for Graph Anomaly Detection With Pattern Mining and Feature Learning.一种结合模式挖掘与特征学习的图异常检测协同方法。

IEEE Trans Neural Netw Learn Syst. 2022 Jun;33(6):2393-2405. doi: 10.1109/TNNLS.2021.3102609. Epub 2022 Jun 1.

Integrating multi-OMICS data through sparse canonical correlation analysis for the prediction of complex traits: a comparison study.通过稀疏典型相关分析整合多组学数据以预测复杂性状：一项比较研究。

Bioinformatics. 2020 Nov 1;36(17):4616-4625. doi: 10.1093/bioinformatics/btaa530.

本文引用的文献

D-CCA: A Decomposition-based Canonical Correlation Analysis for High-Dimensional Datasets.D-CCA：一种用于高维数据集的基于分解的典型相关分析

J Am Stat Assoc. 2020;115(529):292-306. doi: 10.1080/01621459.2018.1543599. Epub 2019 Apr 11.

An iterative penalized least squares approach to sparse canonical correlation analysis.一种用于稀疏典型相关分析的迭代惩罚最小二乘法。

Biometrics. 2019 Sep;75(3):734-744. doi: 10.1111/biom.13043. Epub 2019 Apr 9.

Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer.起源细胞模式主导了 33 种癌症类型的 10000 个肿瘤的分子分类。

Cell. 2018 Apr 5;173(2):291-304.e6. doi: 10.1016/j.cell.2018.03.022.

Genomic, Pathway Network, and Immunologic Features Distinguishing Squamous Carcinomas.基因组、通路网络和免疫特征区分鳞状细胞癌。

Cell Rep. 2018 Apr 3;23(1):194-212.e6. doi: 10.1016/j.celrep.2018.03.063.

Asymptotics of empirical eigenstructure for high dimensional spiked covariance.高维尖峰协方差的经验特征结构渐近性

Ann Stat. 2017 Jun;45(3):1342-1374. doi: 10.1214/16-AOS1487. Epub 2017 Jun 13.

The NCI Genomic Data Commons as an engine for precision medicine.美国国立癌症研究所基因组数据共享库作为精准医学的引擎。

Blood. 2017 Jul 27;130(4):453-459. doi: 10.1182/blood-2017-03-735654. Epub 2017 Jun 9.

Regularized Generalized Canonical Correlation Analysis: A Framework for Sequential Multiblock Component Methods.正则化广义典型相关分析：一种用于顺序多块成分方法的框架。

Psychometrika. 2017 May 23. doi: 10.1007/s11336-017-9573-x.

Imaging biomarkers in Parkinson's disease and Parkinsonian syndromes: current and emerging concepts.帕金森病及帕金森综合征中的影像学生物标志物：当前概念与新进展

Transl Neurodegener. 2017 Mar 28;6:8. doi: 10.1186/s40035-017-0076-6. eCollection 2017.

Alignment of Tractograms As Graph Matching.作为图形匹配的纤维束图对齐

Front Neurosci. 2016 Dec 5;10:554. doi: 10.3389/fnins.2016.00554. eCollection 2016.

Separating common from distinctive variation.区分常见变异与独特变异。

BMC Bioinformatics. 2016 Jun 6;17 Suppl 5(Suppl 5):195. doi: 10.1186/s12859-016-1037-2.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。