因子化嵌入通过因子张量分解学习丰富且具有生物学意义的嵌入空间。

Factorized embeddings learns rich and biologically meaningful embedding spaces using factorized tensor decomposition.

机构信息

Department of Computer Science, Univerity of Montreal, Québec, Canada.

Institute for Research in Immunology and Cancer, Univerity of Montreal, Québec, Canada.

出版信息

Bioinformatics. 2020 Jul 1;36(Suppl_1):i417-i426. doi: 10.1093/bioinformatics/btaa488.

DOI:10.1093/bioinformatics/btaa488

PMID:32657403

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7355243/

Abstract

MOTIVATION

The recent development of sequencing technologies revolutionized our understanding of the inner workings of the cell as well as the way disease is treated. A single RNA sequencing (RNA-Seq) experiment, however, measures tens of thousands of parameters simultaneously. While the results are information rich, data analysis provides a challenge. Dimensionality reduction methods help with this task by extracting patterns from the data by compressing it into compact vector representations.

RESULTS

We present the factorized embeddings (FE) model, a self-supervised deep learning algorithm that learns simultaneously, by tensor factorization, gene and sample representation spaces. We ran the model on RNA-Seq data from two large-scale cohorts and observed that the sample representation captures information on single gene and global gene expression patterns. Moreover, we found that the gene representation space was organized such that tissue-specific genes, highly correlated genes as well as genes participating in the same GO terms were grouped. Finally, we compared the vector representation of samples learned by the FE model to other similar models on 49 regression tasks. We report that the representations trained with FE rank first or second in all of the tasks, surpassing, sometimes by a considerable margin, other representations.

AVAILABILITY AND IMPLEMENTATION

A toy example in the form of a Jupyter Notebook as well as the code and trained embeddings for this project can be found at: https://github.com/TrofimovAssya/FactorizedEmbeddings.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

测序技术的最新发展彻底改变了我们对细胞内部运作方式以及疾病治疗方式的理解。然而，单次 RNA 测序（RNA-Seq）实验同时测量了数万个参数。虽然结果信息丰富，但数据分析提供了一个挑战。降维方法通过将数据压缩为紧凑的向量表示形式来提取数据中的模式，从而帮助完成此任务。

结果

我们提出了因子化嵌入（FE）模型，这是一种自监督深度学习算法，通过张量分解同时学习基因和样本表示空间。我们在两个大型队列的 RNA-Seq 数据上运行了该模型，并观察到样本表示捕获了单个基因和全局基因表达模式的信息。此外，我们发现基因表示空间的组织方式使得组织特异性基因、高度相关的基因以及参与相同 GO 术语的基因被分组。最后，我们将 FE 模型学习的样本向量表示与其他类似模型在 49 个回归任务上进行了比较。我们报告说，FE 训练的表示在所有任务中均排名第一或第二，有时甚至远远超过其他表示。