Mitra Raktim, MacLean Adam L
University of Southern California, Los Angeles, CA 90007, USA.
Bioinformatics. 2021 Oct 11;37(19):3252-3262. doi: 10.1093/bioinformatics/btab260.
Methods to model dynamic changes in gene expression at a genome-wide level are not currently sufficient for large (temporally rich or single-cell) datasets. Variational autoencoders offer means to characterize large datasets and have been used effectively to characterize features of single-cell datasets. Here, we extend these methods for use with gene expression time series data.
We present RVAgene: a recurrent variational autoencoder to model gene expression dynamics. RVAgene learns to accurately and efficiently reconstruct temporal gene profiles. It also learns a low dimensional representation of the data via a recurrent encoder network that can be used for biological feature discovery, and from which we can generate new gene expression data by sampling the latent space. We test RVAgene on simulated and real biological datasets, including embryonic stem cell differentiation and kidney injury response dynamics. In all cases, RVAgene accurately reconstructed complex gene expression temporal profiles. Via cross validation, we show that a low-error latent space representation can be learnt using only a fraction of the data. Through clustering and gene ontology term enrichment analysis on the latent space, we demonstrate the potential of RVAgene for unsupervised discovery. In particular, RVAgene identifies new programs of shared gene regulation of Lox family genes in response to kidney injury.
All datasets analyzed in this manuscript are publicly available and have been published previously. RVAgene is available in Python, at GitHub: https://github.com/maclean-lab/RVAgene; Zenodo archive: http://doi.org/10.5281/zenodo.4271097.
Supplementary data are available at Bioinformatics online.
目前,在全基因组水平上对基因表达动态变化进行建模的方法尚不足以处理大型(时间丰富或单细胞)数据集。变分自编码器提供了表征大型数据集的方法,并已有效地用于表征单细胞数据集的特征。在这里,我们扩展这些方法以用于基因表达时间序列数据。
我们提出了RVAgene:一种用于对基因表达动态进行建模的循环变分自编码器。RVAgene学习学习准确有效地重建时间基因谱。它还通过循环编码器网络学习数据的低维表示,该网络可用于生物特征发现,并且我们可以通过对潜在空间进行采样从中生成新的基因表达数据。我们在模拟和真实生物数据集上测试了RVAgene,包括胚胎干细胞分化和肾损伤反应动态。在所有情况下,RVAgene都能准确重建复杂的基因表达时间谱。通过交叉验证,我们表明仅使用一小部分数据就能学习到低误差的潜在空间表示。通过对潜在空间进行聚类和基因本体术语富集分析,我们展示了RVAgene进行无监督发现的潜力。特别是,RVAgene识别出了Lox家族基因在肾损伤反应中共享基因调控的新程序。
本手稿中分析的所有数据集均可公开获取且已在之前发表。RVAgene以Python语言提供,可在GitHub上获取:https://github.com/maclean-lab/RVAgene;Zenodo存档:http://doi.org/10.5281/zenodo.4271097。
补充数据可在《生物信息学》在线获取。