Xia Jiazhi, Zhang Yuchen, Song Jie, Chen Yang, Wang Yunhai, Liu Shixia
IEEE Trans Vis Comput Graph. 2022 Jan;28(1):529-539. doi: 10.1109/TVCG.2021.3114694. Epub 2021 Dec 24.
Dimensionality Reduction (DR) techniques can generate 2D projections and enable visual exploration of cluster structures of high-dimensional datasets. However, different DR techniques would yield various patterns, which significantly affect the performance of visual cluster analysis tasks. We present the results of a user study that investigates the influence of different DR techniques on visual cluster analysis. Our study focuses on the most concerned property types, namely the linearity and locality, and evaluates twelve representative DR techniques that cover the concerned properties. Four controlled experiments were conducted to evaluate how the DR techniques facilitate the tasks of 1) cluster identification, 2) membership identification, 3) distance comparison, and 4) density comparison, respectively. We also evaluated users' subjective preference of the DR techniques regarding the quality of projected clusters. The results show that: 1) Non-linear and Local techniques are preferred in cluster identification and membership identification; 2) Linear techniques perform better than non-linear techniques in density comparison; 3) UMAP (Uniform Manifold Approximation and Projection) and t-SNE (t-Distributed Stochastic Neighbor Embedding) perform the best in cluster identification and membership identification; 4) NMF (Nonnegative Matrix Factorization) has competitive performance in distance comparison; 5) t-SNLE (t-Distributed Stochastic Neighbor Linear Embedding) has competitive performance in density comparison.
降维(DR)技术可以生成二维投影,并能够对高维数据集的聚类结构进行可视化探索。然而,不同的降维技术会产生各种模式,这对视觉聚类分析任务的性能有显著影响。我们展示了一项用户研究的结果,该研究调查了不同降维技术对视觉聚类分析的影响。我们的研究聚焦于最受关注的属性类型,即线性和局部性,并评估了涵盖相关属性的十二种代表性降维技术。进行了四项对照实验,分别评估降维技术如何促进1)聚类识别、2)成员识别、3)距离比较和4)密度比较任务。我们还评估了用户对降维技术在投影聚类质量方面的主观偏好。结果表明:1)在聚类识别和成员识别中,非线性和局部技术更受青睐;2)在密度比较中,线性技术比非线性技术表现更好;3)UMAP(均匀流形近似与投影)和t-SNE(t分布随机邻域嵌入)在聚类识别和成员识别中表现最佳;4)非负矩阵分解(NMF)在距离比较中具有竞争力;5)t-SNLE(t分布随机邻域线性嵌入)在密度比较中具有竞争力。