Suppr超能文献

使用多个潜在空间维度压缩基因表达数据可学习互补的生物学表现形式。

Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations.

机构信息

Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA.

Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA.

出版信息

Genome Biol. 2020 May 11;21(1):109. doi: 10.1186/s13059-020-02021-3.

Abstract

BACKGROUND

Unsupervised compression algorithms applied to gene expression data extract latent or hidden signals representing technical and biological sources of variation. However, these algorithms require a user to select a biologically appropriate latent space dimensionality. In practice, most researchers fit a single algorithm and latent dimensionality. We sought to determine the extent by which selecting only one fit limits the biological features captured in the latent representations and, consequently, limits what can be discovered with subsequent analyses.

RESULTS

We compress gene expression data from three large datasets consisting of adult normal tissue, adult cancer tissue, and pediatric cancer tissue. We train many different models across a large range of latent space dimensionalities and observe various performance differences. We identify more curated pathway gene sets significantly associated with individual dimensions in denoising autoencoder and variational autoencoder models trained using an intermediate number of latent dimensionalities. Combining compressed features across algorithms and dimensionalities captures the most pathway-associated representations. When trained with different latent dimensionalities, models learn strongly associated and generalizable biological representations including sex, neuroblastoma MYCN amplification, and cell types. Stronger signals, such as tumor type, are best captured in models trained at lower dimensionalities, while more subtle signals such as pathway activity are best identified in models trained with more latent dimensionalities.

CONCLUSIONS

There is no single best latent dimensionality or compression algorithm for analyzing gene expression data. Instead, using features derived from different compression models across multiple latent space dimensionalities enhances biological representations.

摘要

背景

应用于基因表达数据的无监督压缩算法提取代表技术和生物学变异源的潜在或隐藏信号。然而,这些算法要求用户选择合适的潜在空间维度。在实践中,大多数研究人员拟合一种算法和潜在维度。我们试图确定仅选择一种拟合方法会在多大程度上限制潜在表示中捕获的生物学特征,从而限制后续分析可以发现的内容。

结果

我们压缩了来自三个大型数据集的基因表达数据,这些数据集由成人正常组织、成人癌症组织和儿科癌症组织组成。我们在广泛的潜在空间维度范围内训练了许多不同的模型,并观察到各种性能差异。我们在使用中间数量的潜在维度训练的去噪自动编码器和变分自动编码器模型中,确定了更多经过精心整理的途径基因集与单个维度显著相关。跨算法和维度组合压缩特征可捕获与途径最相关的表示。当使用不同的潜在维度进行训练时,模型学习到强烈相关且可推广的生物学表示,包括性别、神经母细胞瘤 MYCN 扩增和细胞类型。更强的信号,如肿瘤类型,在训练维度较低的模型中最佳捕获,而更微妙的信号,如途径活性,在训练维度较高的模型中最佳识别。

结论

分析基因表达数据没有单一的最佳潜在维度或压缩算法。相反,使用来自多个潜在空间维度的不同压缩模型的特征可以增强生物学表示。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2625/7212571/57311789f588/13059_2020_2021_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验