Department of Biochemistry and Molecular Biology, University of Southern Denmark, Campusvej 55, DK-5230 Odense M, Denmark.
Bioinformatics. 2010 Nov 15;26(22):2841-8. doi: 10.1093/bioinformatics/btq534. Epub 2010 Sep 29.
Fuzzy c-means clustering is widely used to identify cluster structures in high-dimensional datasets, such as those obtained in DNA microarray and quantitative proteomics experiments. One of its main limitations is the lack of a computationally fast method to set optimal values of algorithm parameters. Wrong parameter values may either lead to the inclusion of purely random fluctuations in the results or ignore potentially important data. The optimal solution has parameter values for which the clustering does not yield any results for a purely random dataset but which detects cluster formation with maximum resolution on the edge of randomness.
Estimation of the optimal parameter values is achieved by evaluation of the results of the clustering procedure applied to randomized datasets. In this case, the optimal value of the fuzzifier follows common rules that depend only on the main properties of the dataset. Taking the dimension of the set and the number of objects as input values instead of evaluating the entire dataset allows us to propose a functional relationship determining the fuzzifier directly. This result speaks strongly against using a predefined fuzzifier as typically done in many previous studies. Validation indices are generally used for the estimation of the optimal number of clusters. A comparison shows that the minimum distance between the centroids provides results that are at least equivalent or better than those obtained by other computationally more expensive indices.
模糊 c-均值聚类广泛用于识别高维数据集(如 DNA 微阵列和定量蛋白质组学实验中获得的数据集)中的聚类结构。它的主要限制之一是缺乏一种计算快速的方法来设置算法参数的最优值。错误的参数值可能导致结果中包含纯粹的随机波动,或者忽略潜在的重要数据。最优解的参数值为聚类对于纯粹的随机数据集没有任何结果,但在随机性的边缘以最大分辨率检测到聚类形成。
通过评估应用于随机数据集的聚类过程的结果来实现最优参数值的估计。在这种情况下,模糊系数的最优值遵循仅取决于数据集主要属性的常见规则。将集合的维度和对象的数量作为输入值,而不是评估整个数据集,使我们能够提出一个确定模糊系数的直接函数关系。这一结果强烈反对像许多以前的研究中那样使用预定义的模糊系数。通常使用验证指标来估计最佳聚类数。比较表明,质心之间的最小距离提供的结果至少与其他计算成本更高的指标获得的结果相当或更好。