IEEE Trans Vis Comput Graph. 2021 Feb;27(2):1720-1730. doi: 10.1109/TVCG.2020.3030432. Epub 2021 Jan 28.
Given a scatterplot with tens of thousands of points or even more, a natural question is which sampling method should be used to create a small but "good" scatterplot for a better abstraction. We present the results of a user study that investigates the influence of different sampling strategies on multi-class scatterplots. The main goal of this study is to understand the capability of sampling methods in preserving the density, outliers, and overall shape of a scatterplot. To this end, we comprehensively review the literature and select seven typical sampling strategies as well as eight representative datasets. We then design four experiments to understand the performance of different strategies in maintaining: 1) region density; 2) class density; 3) outliers; and 4) overall shape in the sampling results. The results show that: 1) random sampling is preferred for preserving region density; 2) blue noise sampling and random sampling have comparable performance with the three multi-class sampling strategies in preserving class density; 3) outlier biased density based sampling, recursive subdivision based sampling, and blue noise sampling perform the best in keeping outliers; and 4) blue noise sampling outperforms the others in maintaining the overall shape of a scatterplot.
给定一个有成千上万甚至更多点的散点图,一个自然的问题是应该使用哪种抽样方法来创建一个小而“好”的散点图,以便更好地抽象。我们展示了一项用户研究的结果,该研究调查了不同抽样策略对多类散点图的影响。这项研究的主要目的是了解抽样方法在保持散点图的密度、异常值和整体形状方面的能力。为此,我们全面回顾了文献,选择了七种典型的抽样策略和八种代表性数据集。然后,我们设计了四个实验来了解不同策略在保持以下方面的性能:1)区域密度;2)类密度;3)异常值;以及 4)抽样结果中的整体形状。结果表明:1)随机抽样更适合保持区域密度;2)蓝噪声抽样和随机抽样在保持类密度方面与三种多类抽样策略具有相当的性能;3)基于异常值偏置密度的抽样、基于递归细分的抽样和蓝噪声抽样在保留异常值方面表现最好;以及 4)蓝噪声抽样在保持散点图的整体形状方面优于其他方法。