Graduate School of Data Science, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea.
Department of Mathematical Sciences, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea.
BMC Bioinformatics. 2023 Nov 14;24(1):432. doi: 10.1186/s12859-023-05552-1.
Deep generative models naturally become nonlinear dimension reduction tools to visualize large-scale datasets such as single-cell RNA sequencing datasets for revealing latent grouping patterns or identifying outliers. The variational autoencoder (VAE) is a popular deep generative method equipped with encoder/decoder structures. The encoder and decoder are useful when a new sample is mapped to the latent space and a data point is generated from a point in a latent space. However, the VAE tends not to show grouping pattern clearly without additional annotation information. On the other hand, similarity-based dimension reduction methods such as t-SNE or UMAP present clear grouping patterns even though these methods do not have encoder/decoder structures.
To bridge this gap, we propose a new approach that adopts similarity information in the VAE framework. In addition, for biological applications, we extend our approach to a conditional VAE to account for covariate effects in the dimension reduction step. In the simulation study and real single-cell RNA sequencing data analyses, our method shows great performance compared to existing state-of-the-art methods by producing clear grouping structures using an inferred encoder and decoder. Our method also successfully adjusts for covariate effects, resulting in more useful dimension reduction.
Our method is able to produce clearer grouping patterns than those of other regularized VAE methods by utilizing similarity information encoded in the data via the highly celebrated UMAP loss function.
深度生成模型自然成为可视化大规模数据集(如单细胞 RNA 测序数据集)的非线性降维工具,以揭示潜在的分组模式或识别异常值。变分自动编码器(VAE)是一种流行的深度生成方法,配备了编码器/解码器结构。当将新样本映射到潜在空间并从潜在空间中的一个点生成数据点时,编码器和解码器非常有用。然而,VAE 往往没有额外的注释信息,无法清晰地显示分组模式。另一方面,基于相似性的降维方法(如 t-SNE 或 UMAP)即使没有编码器/解码器结构,也能呈现出清晰的分组模式。
为了弥合这一差距,我们提出了一种新方法,该方法采用了 VAE 框架中的相似性信息。此外,为了生物应用,我们将我们的方法扩展到条件 VAE 中,以在降维步骤中考虑协变量的影响。在模拟研究和真实的单细胞 RNA 测序数据分析中,与现有的最先进的方法相比,我们的方法通过使用推断的编码器和解码器生成清晰的分组结构,表现出了出色的性能。我们的方法还成功地调整了协变量的影响,从而实现了更有用的降维。
我们的方法通过利用高度著名的 UMAP 损失函数对数据中编码的相似性信息,能够产生比其他正则化 VAE 方法更清晰的分组模式。