IEEE Trans Vis Comput Graph. 2018 Jan;24(1):142-151. doi: 10.1109/TVCG.2017.2745085. Epub 2017 Aug 29.
Clustering, the process of grouping together similar items into distinct partitions, is a common type of unsupervised machine learning that can be useful for summarizing and aggregating complex multi-dimensional data. However, data can be clustered in many ways, and there exist a large body of algorithms designed to reveal different patterns. While having access to a wide variety of algorithms is helpful, in practice, it is quite difficult for data scientists to choose and parameterize algorithms to get the clustering results relevant for their dataset and analytical tasks. To alleviate this problem, we built Clustervision, a visual analytics tool that helps ensure data scientists find the right clustering among the large amount of techniques and parameters available. Our system clusters data using a variety of clustering techniques and parameters and then ranks clustering results utilizing five quality metrics. In addition, users can guide the system to produce more relevant results by providing task-relevant constraints on the data. Our visual user interface allows users to find high quality clustering results, explore the clusters using several coordinated visualization techniques, and select the cluster result that best suits their task. We demonstrate this novel approach using a case study with a team of researchers in the medical domain and showcase that our system empowers users to choose an effective representation of their complex data.
聚类是将相似的项目分组到不同分区中的过程,是一种常见的无监督机器学习类型,可用于总结和聚合复杂的多维数据。但是,数据可以通过多种方式进行聚类,并且存在大量旨在揭示不同模式的算法。虽然可以访问各种算法很有帮助,但在实践中,数据科学家很难选择和参数化算法以获得与他们的数据集和分析任务相关的聚类结果。为了解决这个问题,我们构建了 Clustervision,这是一种可视化分析工具,可以帮助确保数据科学家在大量可用的技术和参数中找到正确的聚类。我们的系统使用各种聚类技术和参数对数据进行聚类,然后利用五个质量指标对聚类结果进行排名。此外,用户可以通过对数据提供与任务相关的约束来指导系统生成更相关的结果。我们的可视化用户界面允许用户找到高质量的聚类结果,使用几种协调的可视化技术探索聚类,并选择最适合他们任务的聚类结果。我们使用医疗领域的研究团队的案例研究展示了这种新方法,并展示了我们的系统使用户能够选择他们复杂数据的有效表示。