使用多输出高斯过程学习单细胞多组学数据的可解释表示。

Learning interpretable representations of single-cell multi-omics data with multi-output Gaussian processes.

作者信息

Moslehi Zahra, AmeriFar Sareh, de Azevedo Kevin, Buettner Florian

机构信息

German Cancer Consortium (DKTK), partner site Frankfurt/Mainz, a partnership between DKFZ and UCT Frankfurt-Marburg, 60590 Frankfurt am Main, Germany.

German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany.

出版信息

Nucleic Acids Res. 2025 Jul 19;53(14). doi: 10.1093/nar/gkaf630.

DOI:10.1093/nar/gkaf630

PMID:40694853

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12282953/

Abstract

Learning representations of single-cell genomics data is challenging due to the nonlinear and often multi-modal nature of the data on one hand and the need for interpretable representations on the other hand. Existing approaches tend to focus either on interpretability aspects via linear matrix factorization or on maximizing expressive power via neural network-based embeddings using black-box variational autoencoders or graph embedding approaches. We address this trade-off between expressive power and interpretability by introducing a novel approach that combines highly expressive representation learning via an embedding layer with interpretable multi-output Gaussian processes within a unified framework. In our model, we learn distinct representations for samples (cells) and features (genes) from multi-modal single-cell data. We demonstrate that even a few interpretable latent dimensions can effectively capture the underlying structure of the data. Our model yields interpretable relationships between groups of cells and their associated marker genes: leveraging a gene relevance map, we establish connections between cell clusters (e.g. specific cell types) and feature clusters (e.g. marker genes for those specific cell types) within the learned latent spaces of cells and features.

摘要

单细胞基因组学数据的表示学习具有挑战性，一方面是由于数据的非线性且通常是多模态性质，另一方面是需要可解释的表示。现有方法往往要么侧重于通过线性矩阵分解实现可解释性，要么侧重于使用黑箱变分自编码器或图嵌入方法通过基于神经网络的嵌入来最大化表达能力。我们通过引入一种新颖的方法来解决表达能力和可解释性之间的这种权衡，该方法在统一框架内将通过嵌入层进行的高表达表示学习与可解释的多输出高斯过程相结合。在我们的模型中，我们从多模态单细胞数据中学习样本（细胞）和特征（基因）的不同表示。我们证明，即使是几个可解释的潜在维度也可以有效地捕获数据的底层结构。我们的模型在细胞组与其相关标记基因之间产生可解释的关系：利用基因相关性图，我们在细胞和特征的学习潜在空间内建立细胞簇（例如特定细胞类型）和特征簇（例如那些特定细胞类型的标记基因）之间的联系。