Computational Neurobiology Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037, U.S.A.
Division of Biological Sciences, University of California San Diego, La Jolla, CA 92037, U.S.A.
Neural Comput. 2022 Jul 14;34(8):1637-1651. doi: 10.1162/neco_a_01504.
The t-distributed stochastic neighbor embedding (t-SNE) method is one of the leading techniques for data visualization and clustering. This method finds lower-dimensional embedding of data points while minimizing distortions in distances between neighboring data points. By construction, t-SNE discards information about large-scale structure of the data. We show that adding a global cost function to the t-SNE cost function makes it possible to cluster the data while preserving global intercluster data structure. We test the new global t-SNE (g-SNE) method on one synthetic and two real data sets on flower shapes and human brain cells. We find that significant and meaningful global structure exists in both the plant and human brain data sets. In all cases, g-SNE outperforms t-SNE and UMAP in preserving the global structure. Topological analysis of the clustering result makes it possible to find an appropriate trade-off of data distribution across scales. We find differences in how data are distributed across scales between the two subjects that were part of the human brain data set. Thus, by striving to produce both accurate clustering and positioning between clusters, the g-SNE method can identify new aspects of data organization across scales.
t 分布随机邻嵌入(t-SNE)方法是数据可视化和聚类的领先技术之一。该方法在最小化邻域数据点之间距离失真的同时,找到数据点的低维嵌入。通过构造,t-SNE 丢弃了数据大规模结构的信息。我们表明,在 t-SNE 成本函数中添加全局成本函数使得在保留全局聚类间数据结构的同时对数据进行聚类成为可能。我们在一个合成数据集和两个关于花形状和人类脑细胞的真实数据集上测试了新的全局 t-SNE(g-SNE)方法。我们发现,在植物和人类大脑数据集都存在显著且有意义的全局结构。在所有情况下,g-SNE 在保留全局结构方面都优于 t-SNE 和 UMAP。聚类结果的拓扑分析使得可以在不同尺度上的数据分布之间找到一个合适的权衡。我们发现,作为人类大脑数据集一部分的两个对象之间在数据如何在不同尺度上分布方面存在差异。因此,通过努力实现聚类的准确性和聚类之间的定位,g-SNE 方法可以识别数据跨尺度组织的新方面。