用于机器学习性能评估的轮廓分析及其在聚类中的应用

Silhouette Analysis for Performance Evaluation in Machine Learning with Applications to Clustering.

作者信息

Shutaywi Meshal, Kachouie Nezamoddin N

机构信息

Department of Mathematics, King Abdulaziz University, Rabigh 21911, Saudi Arabia.

Department of Mathematical Sciences, Florida Institute of Technology, Melbourne, FL 32901, USA.

出版信息

Entropy (Basel). 2021 Jun 16;23(6):759. doi: 10.3390/e23060759.

DOI:10.3390/e23060759

PMID:34208552

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8234541/

Abstract

Grouping the objects based on their similarities is an important common task in machine learning applications. Many clustering methods have been developed, among them k-means based clustering methods have been broadly used and several extensions have been developed to improve the original k-means clustering method such as k-means ++ and kernel k-means. K-means is a linear clustering method; that is, it divides the objects into linearly separable groups, while kernel k-means is a non-linear technique. Kernel k-means projects the elements to a higher dimensional feature space using a kernel function, and then groups them. Different kernel functions may not perform similarly in clustering of a data set and, in turn, choosing the right kernel for an application could be challenging. In our previous work, we introduced a weighted majority voting method for clustering based on normalized mutual information (NMI). NMI is a supervised method where the true labels for a training set are required to calculate NMI. In this study, we extend our previous work of aggregating the clustering results to develop an unsupervised weighting function where a training set is not available. The proposed weighting function here is based on Silhouette index, as an unsupervised criterion. As a result, a training set is not required to calculate Silhouette index. This makes our new method more sensible in terms of clustering concept.

摘要

基于对象的相似性对其进行分组是机器学习应用中的一项重要常见任务。已经开发了许多聚类方法，其中基于k均值的聚类方法被广泛使用，并且已经开发了几种扩展方法来改进原始的k均值聚类方法，如k均值++和核k均值。k均值是一种线性聚类方法；也就是说，它将对象划分为线性可分的组，而核k均值是一种非线性技术。核k均值使用核函数将元素投影到更高维的特征空间，然后对它们进行分组。不同的核函数在数据集的聚类中可能表现不同，反过来，为应用选择合适的核可能具有挑战性。在我们之前的工作中，我们引入了一种基于归一化互信息（NMI）的加权多数投票聚类方法。NMI是一种监督方法，需要训练集的真实标签来计算NMI。在本研究中，我们扩展了之前聚合聚类结果的工作，以开发一种在没有训练集的情况下的无监督加权函数。这里提出的加权函数基于轮廓系数，作为一种无监督准则。因此，计算轮廓系数不需要训练集。这使得我们的新方法在聚类概念方面更加合理。