用于全观测或部分观测多模态数据集成的广义概率典型相关分析

Generalized Probabilistic Canonical Correlation Analysis for Multi-modal Data Integration with Full or Partial Observations.

作者信息

Yang Tianjian, Li Wei Vivian

机构信息

Department of Statistics, University of California, Riverside.

出版信息

ArXiv. 2025 Apr 15:arXiv:2504.11610v1.

PMID:40321951

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12047925/

Abstract

BACKGROUND

The integration and analysis of multi-modal data are increasingly essential across various domains including bioinformatics. As the volume and complexity of such data grow, there is a pressing need for computational models that not only integrate diverse modalities but also leverage their complementary information to improve clustering accuracy and insights, especially when dealing with partial observations with missing data.

RESULTS

We propose Generalized Probabilistic Canonical Correlation Analysis (GPCCA), an unsupervised method for the integration and joint dimensionality reduction of multi-modal data. GPCCA addresses key challenges in multi-modal data analysis by handling missing values within the model, enabling the integration of more than two modalities, and identifying informative features while accounting for correlations within individual modalities. The model demonstrates robustness to various missing data patterns and provides low-dimensional embeddings that facilitate downstream clustering and analysis. In a range of simulation settings, GPCCA outperforms existing methods in capturing essential patterns across modalities. Additionally, we demonstrate its applicability to multi-omics data from TCGA cancer datasets and a multi-view image dataset.

CONCLUSION

GPCCA offers a useful framework for multi-modal data integration, effectively handling missing data and providing informative low-dimensional embeddings. Its performance across cancer genomics and multi-view image data highlights its robustness and potential for broad application. To make the method accessible to the wider research community, we have released an R package, GPCCA, which is available at https://github.com/Kaversoniano/GPCCA.

摘要

背景

多模态数据的整合与分析在包括生物信息学在内的各个领域日益重要。随着此类数据的量和复杂性不断增加，迫切需要这样的计算模型，它不仅能整合不同的模态，还能利用其互补信息来提高聚类准确性和洞察力，尤其是在处理存在缺失数据的部分观测值时。

结果

我们提出了广义概率典型相关分析（GPCCA），这是一种用于多模态数据整合和联合降维的无监督方法。GPCCA通过在模型中处理缺失值、实现整合两种以上模态以及在考虑各个模态内相关性的同时识别信息特征，解决了多模态数据分析中的关键挑战。该模型对各种缺失数据模式具有鲁棒性，并提供低维嵌入，便于下游的聚类和分析。在一系列模拟设置中，GPCCA在跨模态捕获基本模式方面优于现有方法。此外，我们展示了它在来自TCGA癌症数据集的多组学数据和一个多视图图像数据集上的适用性。

结论

GPCCA为多模态数据整合提供了一个有用的框架，有效处理缺失数据并提供信息丰富的低维嵌入。它在癌症基因组学和多视图图像数据方面的性能突出了其鲁棒性和广泛应用的潜力。为了使更广泛的研究群体能够使用该方法，我们发布了一个R包GPCCA，可在https://github.com/Kaversoniano/GPCCA获取。