Liu Zhexuan, Ma Rong, Zhong Yiqiao
Department of Statistics, University of Wisconsin-Madison, Madison, WI, USA.
Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, USA.
Nat Commun. 2025 May 30;16(1):5037. doi: 10.1038/s41467-025-60434-9.
Visualizing high-dimensional data is essential for understanding biomedical data and deep learning models. Neighbor embedding methods, such as t-SNE and UMAP, are widely used but can introduce misleading visual artifacts. We find that the manifold learning interpretations from many prior works are inaccurate and that the misuse stems from a lack of data-independent notions of embedding maps, which project high-dimensional data into a lower-dimensional space. Leveraging the leave-one-out principle, we introduce LOO-map, a framework that extends embedding maps beyond discrete points to the entire input space. We identify two forms of map discontinuity that distort visualizations: one exaggerates cluster separation and the other creates spurious local structures. As a remedy, we develop two types of point-wise diagnostic scores to detect unreliable embedding points and improve hyperparameter selection, which are validated on datasets from computer vision and single-cell omics.
可视化高维数据对于理解生物医学数据和深度学习模型至关重要。诸如t-SNE和UMAP等邻域嵌入方法被广泛使用,但可能会引入误导性的视觉伪影。我们发现,许多先前工作中的流形学习解释是不准确的,并且这种误用源于缺乏将高维数据投影到低维空间的嵌入映射的与数据无关的概念。利用留一法原则,我们引入了LOO-map,这是一个将嵌入映射从离散点扩展到整个输入空间的框架。我们识别出两种会扭曲可视化的映射不连续性形式:一种会夸大聚类分离,另一种会创建虚假的局部结构。作为一种补救措施,我们开发了两种逐点诊断分数来检测不可靠的嵌入点并改进超参数选择,这些分数在来自计算机视觉和单细胞组学的数据集上得到了验证。