Ren Min, Liu Peiyu, Wang Zhihao, Yi Jing
School of Information Science and Engineering, Shandong Normal University, Jinan, Shandong, China; School of Mathematic and Quantitative Economics, Shandong University of Finance and Economics, Jinan, Shandong, China; Shandong Provincial Key Laboratory for Distributed Computer Software Novel Technology, Jinan, Shandong, China.
School of Information Science and Engineering, Shandong Normal University, Jinan, Shandong, China; Shandong Provincial Key Laboratory for Distributed Computer Software Novel Technology, Jinan, Shandong, China.
Comput Intell Neurosci. 2016;2016:2647389. doi: 10.1155/2016/2647389. Epub 2016 Nov 29.
For the shortcoming of fuzzy -means algorithm (FCM) needing to know the number of clusters in advance, this paper proposed a new self-adaptive method to determine the optimal number of clusters. Firstly, a density-based algorithm was put forward. The algorithm, according to the characteristics of the dataset, automatically determined the possible maximum number of clusters instead of using the empirical rule [Formula: see text] and obtained the optimal initial cluster centroids, improving the limitation of FCM that randomly selected cluster centroids lead the convergence result to the local minimum. Secondly, this paper, by introducing a penalty function, proposed a new fuzzy clustering validity index based on fuzzy compactness and separation, which ensured that when the number of clusters verged on that of objects in the dataset, the value of clustering validity index did not monotonically decrease and was close to zero, so that the optimal number of clusters lost robustness and decision function. Then, based on these studies, a self-adaptive FCM algorithm was put forward to estimate the optimal number of clusters by the iterative trial-and-error process. At last, experiments were done on the UCI, KDD Cup 1999, and synthetic datasets, which showed that the method not only effectively determined the optimal number of clusters, but also reduced the iteration of FCM with the stable clustering result.
针对模糊均值算法(FCM)需要预先知道聚类数量的缺点,本文提出了一种新的自适应方法来确定最优聚类数。首先,提出了一种基于密度的算法。该算法根据数据集的特征,自动确定可能的最大聚类数,而不是使用经验规则[公式:见原文],并获得最优的初始聚类中心,改善了FCM随机选择聚类中心导致收敛结果陷入局部最小值的局限性。其次,本文通过引入惩罚函数,提出了一种基于模糊紧致性和分离度的新的模糊聚类有效性指标,确保当聚类数接近数据集中对象的数量时,聚类有效性指标的值不会单调下降并接近零,从而使最优聚类数失去鲁棒性和决策功能。然后,基于这些研究,提出了一种自适应FCM算法,通过迭代试错过程来估计最优聚类数。最后,在UCI、1999年KDD杯和合成数据集上进行了实验,结果表明该方法不仅能有效地确定最优聚类数,还能减少FCM的迭代次数,且聚类结果稳定。