Qiu Wei, Dincer Ayse B, Janizek Joseph D, Celik Safiye, Pittet Mikael, Naxerova Kamila, Lee Su-In
Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA.
Medical Scientist Training Program, University of Washington, Seattle, WA.
bioRxiv. 2024 Oct 26:2024.03.17.585426. doi: 10.1101/2024.03.17.585426.
Clinically and biologically valuable information may reside untapped in large cancer gene expression data sets. Deep unsupervised learning has the potential to extract this information with unprecedented efficacy but has thus far been hampered by a lack of biological interpretability and robustness. Here, we present DeepProfile, a comprehensive framework that addresses current challenges in applying unsupervised deep learning to gene expression profiles. We use DeepProfile to learn low-dimensional latent spaces for 18 human cancers from 50,211 transcriptomes. DeepProfile outperforms existing dimensionality reduction methods with respect to biological interpretability. Using DeepProfile interpretability methods, we show that genes that are universally important in defining the latent spaces across all cancer types control immune cell activation, while cancer type-specific genes and pathways define molecular disease subtypes. By linking DeepProfile latent variables to secondary tumor characteristics, we discover that tumor mutation burden is closely associated with the expression of cell cycle-related genes. DNA mismatch repair and MHC class II antigen presentation pathway expression, on the other hand, are consistently associated with patient survival. We validate these results through Kaplan-Meier analyses and nominate tumor-associated macrophages as an important source of survival-correlated MHC class II transcripts. Our results illustrate the power of unsupervised deep learning for discovery of cancer biology from existing gene expression data.
具有临床和生物学价值的信息可能隐藏在大型癌症基因表达数据集中而未被发掘。深度无监督学习有潜力以前所未有的效率提取这些信息,但迄今为止一直受到缺乏生物学可解释性和稳健性的阻碍。在此,我们提出了DeepProfile,这是一个全面的框架,可应对将无监督深度学习应用于基因表达谱时的当前挑战。我们使用DeepProfile从50,211个转录组中学习18种人类癌症的低维潜在空间。在生物学可解释性方面,DeepProfile优于现有的降维方法。使用DeepProfile可解释性方法,我们表明在定义所有癌症类型的潜在空间中普遍重要的基因控制免疫细胞激活,而癌症类型特异性基因和通路定义分子疾病亚型。通过将DeepProfile潜在变量与继发性肿瘤特征联系起来,我们发现肿瘤突变负担与细胞周期相关基因的表达密切相关。另一方面,DNA错配修复和MHC II类抗原呈递途径表达与患者生存始终相关。我们通过Kaplan-Meier分析验证了这些结果,并将肿瘤相关巨噬细胞指定为与生存相关的MHC II类转录本的重要来源。我们的结果说明了无监督深度学习从现有基因表达数据中发现癌症生物学的能力。