Suppr超能文献

一种用于确定模糊 C 均值聚类分析参数的简单快速方法。

A simple and fast method to determine the parameters for fuzzy c-means cluster analysis.

机构信息

Department of Biochemistry and Molecular Biology, University of Southern Denmark, Campusvej 55, DK-5230 Odense M, Denmark.

出版信息

Bioinformatics. 2010 Nov 15;26(22):2841-8. doi: 10.1093/bioinformatics/btq534. Epub 2010 Sep 29.

Abstract

MOTIVATION

Fuzzy c-means clustering is widely used to identify cluster structures in high-dimensional datasets, such as those obtained in DNA microarray and quantitative proteomics experiments. One of its main limitations is the lack of a computationally fast method to set optimal values of algorithm parameters. Wrong parameter values may either lead to the inclusion of purely random fluctuations in the results or ignore potentially important data. The optimal solution has parameter values for which the clustering does not yield any results for a purely random dataset but which detects cluster formation with maximum resolution on the edge of randomness.

RESULTS

Estimation of the optimal parameter values is achieved by evaluation of the results of the clustering procedure applied to randomized datasets. In this case, the optimal value of the fuzzifier follows common rules that depend only on the main properties of the dataset. Taking the dimension of the set and the number of objects as input values instead of evaluating the entire dataset allows us to propose a functional relationship determining the fuzzifier directly. This result speaks strongly against using a predefined fuzzifier as typically done in many previous studies. Validation indices are generally used for the estimation of the optimal number of clusters. A comparison shows that the minimum distance between the centroids provides results that are at least equivalent or better than those obtained by other computationally more expensive indices.

摘要

动机

模糊 c-均值聚类广泛用于识别高维数据集(如 DNA 微阵列和定量蛋白质组学实验中获得的数据集)中的聚类结构。它的主要限制之一是缺乏一种计算快速的方法来设置算法参数的最优值。错误的参数值可能导致结果中包含纯粹的随机波动,或者忽略潜在的重要数据。最优解的参数值为聚类对于纯粹的随机数据集没有任何结果,但在随机性的边缘以最大分辨率检测到聚类形成。

结果

通过评估应用于随机数据集的聚类过程的结果来实现最优参数值的估计。在这种情况下,模糊系数的最优值遵循仅取决于数据集主要属性的常见规则。将集合的维度和对象的数量作为输入值,而不是评估整个数据集,使我们能够提出一个确定模糊系数的直接函数关系。这一结果强烈反对像许多以前的研究中那样使用预定义的模糊系数。通常使用验证指标来估计最佳聚类数。比较表明,质心之间的最小距离提供的结果至少与其他计算成本更高的指标获得的结果相当或更好。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验