Department of Mathematics, Stockholm University, Stockholm, Sweden.
BMC Bioinformatics. 2022 Nov 14;23(1):477. doi: 10.1186/s12859-022-05028-8.
The t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm has emerged as one of the leading methods for visualising high-dimensional (HD) data in a wide variety of fields, especially for revealing cluster structure in HD single-cell transcriptomics data. However, t-SNE often fails to correctly represent hierarchical relationships between clusters and creates spurious patterns in the embedding. In this work we generalised t-SNE using shape-aware graph distances to mitigate some of the limitations of the t-SNE. Although many methods have been recently proposed to circumvent the shortcomings of t-SNE, notably Uniform manifold approximation (UMAP) and Potential of heat diffusion for affinity-based transition embedding (PHATE), we see a clear advantage of the proposed graph-based method.
The superior performance of the proposed method is first demonstrated on simulated data, where a significant improvement compared to t-SNE, UMAP and PHATE, based on quantitative validation indices, is observed when visualising imbalanced, nonlinear, continuous and hierarchically structured data. Thereafter the ability of the proposed method compared to the competing methods to create faithfully low-dimensional embeddings is shown on two real-world data sets, the single-cell transcriptomics data and the MNIST image data. In addition, the only hyper-parameter of the method can be automatically chosen in a data-driven way, which is consistently optimal across all test cases in this study.
In this work we show that the proposed shape-aware stochastic neighbor embedding method creates low-dimensional visualisations that robustly and accurately reveal key structures of high-dimensional data.
t 分布随机近邻嵌入(t-SNE)算法已成为在广泛领域中可视化高维(HD)数据的主要方法之一,尤其是在揭示 HD 单细胞转录组学数据中的聚类结构方面。然而,t-SNE 通常无法正确表示聚类之间的层次关系,并在嵌入中产生虚假模式。在这项工作中,我们使用形状感知图距离对 t-SNE 进行了推广,以减轻 t-SNE 的一些局限性。尽管最近已经提出了许多方法来规避 t-SNE 的缺点,特别是均匀流形逼近(UMAP)和基于热扩散势的相似性转移嵌入(PHATE),但我们看到了所提出的基于图的方法的明显优势。
该方法的优越性能首先在模拟数据上得到了验证,与 t-SNE、UMAP 和 PHATE 相比,在可视化不平衡、非线性、连续和层次结构数据时,基于定量验证指标,观察到了显著的改进。然后,在所提出的方法与竞争方法之间的能力进行了比较,以创建真实的低维嵌入,使用了两个真实世界的数据集,单细胞转录组学数据和 MNIST 图像数据。此外,该方法的唯一超参数可以以数据驱动的方式自动选择,在本研究的所有测试案例中都是一致最优的。
在这项工作中,我们表明所提出的形状感知随机近邻嵌入方法可以创建低维可视化,这些可视化能够稳健且准确地揭示高维数据的关键结构。