Espadoto Mateus, Martins Rafael M, Kerren Andreas, Hirata Nina S T, Telea Alexandru C
IEEE Trans Vis Comput Graph. 2021 Mar;27(3):2153-2173. doi: 10.1109/TVCG.2019.2944182. Epub 2021 Jan 28.
Dimensionality reduction methods, also known as projections, are frequently used in multidimensional data exploration in machine learning, data science, and information visualization. Tens of such techniques have been proposed, aiming to address a wide set of requirements, such as ability to show the high-dimensional data structure, distance or neighborhood preservation, computational scalability, stability to data noise and/or outliers, and practical ease of use. However, it is far from clear for practitioners how to choose the best technique for a given use context. We present a survey of a wide body of projection techniques that helps answering this question. For this, we characterize the input data space, projection techniques, and the quality of projections, by several quantitative metrics. We sample these three spaces according to these metrics, aiming at good coverage with bounded effort. We describe our measurements and outline observed dependencies of the measured variables. Based on these results, we draw several conclusions that help comparing projection techniques, explain their results for different types of data, and ultimately help practitioners when choosing a projection for a given context. Our methodology, datasets, projection implementations, metrics, visualizations, and results are publicly open, so interested stakeholders can examine and/or extend this benchmark.
降维方法,也称为投影法,常用于机器学习、数据科学和信息可视化中的多维数据探索。人们已经提出了数十种此类技术,旨在满足一系列广泛的需求,例如展示高维数据结构的能力、距离或邻域保持、计算可扩展性、对数据噪声和/或异常值的稳定性以及实际易用性。然而,对于从业者来说,如何为给定的使用场景选择最佳技术还远不清楚。我们对大量投影技术进行了调查,以帮助回答这个问题。为此,我们通过几个定量指标来描述输入数据空间、投影技术和投影质量。我们根据这些指标对这三个空间进行采样,旨在以有限的工作量实现良好的覆盖。我们描述我们的测量方法,并概述所测变量之间观察到的相关性。基于这些结果,我们得出了几个有助于比较投影技术的结论,解释它们对不同类型数据的结果,并最终在从业者为给定场景选择投影时提供帮助。我们的方法、数据集、投影实现、指标、可视化和结果都是公开的,因此感兴趣的利益相关者可以检查和/或扩展这个基准。