Kobak Dmitry, Linderman George, Steinerberger Stefan, Kluger Yuval, Berens Philipp
Institute for Ophthalmic Research, University of Tübingen, Germany.
Applied Mathematics Program, Yale University, New Haven, USA.
Mach Learn Knowl Discov Databases. 2020;11906:124-139. doi: 10.1007/978-3-030-46150-8_8. Epub 2020 Apr 30.
T-distributed stochastic neighbour embedding (t-SNE) is a widely used data visualisation technique. It differs from its predecessor SNE by the low-dimensional similarity kernel: the Gaussian kernel was replaced by the heavy-tailed Cauchy kernel, solving the 'crowding problem' of SNE. Here, we develop an efficient implementation of t-SNE for a t-distribution kernel with an arbitrary degree of freedom , with → ∞ corresponding to SNE and = 1 corresponding to the standard t-SNE. Using theoretical analysis and toy examples, we show that < 1 can further reduce the crowding problem and reveal finer cluster structure that is invisible in standard t-SNE. We further demonstrate the striking effect of heavier-tailed kernels on large real-life data sets such as MNIST, single-cell RNA-sequencing data, and the HathiTrust library. We use domain knowledge to confirm that the revealed clusters are meaningful. Overall, we argue that modifying the tail heaviness of the t-SNE kernel can yield additional insight into the cluster structure of the data.
T分布随机邻域嵌入(t-SNE)是一种广泛使用的数据可视化技术。它与其前身SNE的不同之处在于低维相似性核:高斯核被重尾柯西核所取代,解决了SNE的“拥挤问题”。在此,我们针对具有任意自由度的t分布核开发了一种高效的t-SNE实现,其中 → ∞ 对应于SNE,而 = 1 对应于标准t-SNE。通过理论分析和示例,我们表明 < 1 可以进一步减少拥挤问题,并揭示标准t-SNE中不可见的更精细的聚类结构。我们进一步展示了重尾核在MNIST、单细胞RNA测序数据和哈钦斯信托图书馆等大型现实生活数据集上的显著效果。我们利用领域知识来确认所揭示的聚类是有意义的。总体而言,我们认为修改t-SNE核的尾部厚重程度可以对数据的聚类结构产生额外的见解。