Qiu Wei, Dincer Ayse B, Janizek Joseph D, Celik Safiye, Pittet Mikael J, Naxerova Kamila, Lee Su-In
Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA.
Medical Scientist Training Program, University of Washington, Seattle, WA, USA.
Nat Biomed Eng. 2025 Mar;9(3):333-355. doi: 10.1038/s41551-024-01290-8. Epub 2024 Dec 17.
Clinical and biological information in large datasets of gene expression across cancers could be tapped with unsupervised deep learning. However, difficulties associated with biological interpretability and methodological robustness have made this impractical. Here we describe an unsupervised deep-learning framework for the generation of low-dimensional latent spaces for gene-expression data from 50,211 transcriptomes across 18 human cancers. The framework, which we named DeepProfile, outperformed dimensionality-reduction methods with respect to biological interpretability and allowed us to unveil that genes that are universally important in defining latent spaces across cancer types control immune cell activation, whereas cancer-type-specific genes and pathways define molecular disease subtypes. By linking latent variables in DeepProfile to secondary characteristics of tumours, we discovered that mutation burden is closely associated with the expression of cell-cycle-related genes, and that the activity of biological pathways for DNA-mismatch repair and MHC class II antigen presentation are consistently associated with patient survival. We also found that tumour-associated macrophages are a source of survival-correlated MHC class II transcripts. Unsupervised learning can facilitate the discovery of biological insight from gene-expression data.
通过无监督深度学习,可以挖掘癌症中大量基因表达数据集的临床和生物学信息。然而,与生物学可解释性和方法稳健性相关的困难使得这一做法不切实际。在此,我们描述了一个无监督深度学习框架,用于为来自18种人类癌症的50211个转录组的基因表达数据生成低维潜在空间。我们将该框架命名为DeepProfile,它在生物学可解释性方面优于降维方法,并使我们能够揭示,在定义跨癌症类型的潜在空间中普遍重要的基因控制免疫细胞激活,而癌症类型特异性基因和通路则定义分子疾病亚型。通过将DeepProfile中的潜在变量与肿瘤的二级特征联系起来,我们发现突变负担与细胞周期相关基因的表达密切相关,并且DNA错配修复和MHC II类抗原呈递的生物学通路活性与患者生存始终相关。我们还发现肿瘤相关巨噬细胞是与生存相关的MHC II类转录本的来源。无监督学习可以促进从基因表达数据中发现生物学见解。