Arima Chinatsu, Hakamada Kazumi, Okamoto Masahiro, Hanai Taizo
Graduate School of Systems Life Sciences, Kyushu University, 6-10-1 Hakozaki, Higashi-ku, Fukuoka 812-8581, Japan.
J Biosci Bioeng. 2008 Mar;105(3):273-81. doi: 10.1263/jbb.105.273.
In clustering methods, the estimation of the optimal number of clusters is significant for subsequent analysis. Without detailed biological information on the genes involved, the evaluation of the number of clusters becomes difficult, and we have to rely on an internal measure that is based on the distribution of the data of the clustering result. The Gap statistic has been proposed as a superior method for estimating the number of clusters in crisp clustering. In this study, we proposed a modified Fuzzy Gap statistic (MFGS) and applied it to fuzzy k-means clustering. For estimating the number of clusters, fuzzy k-means clustering with the MFGS was applied to two artificial data sets with noise and to two experimentally observed gene expression data sets. For the artificial data sets, compared with other internal measures, the MFGS showed a higher performance in terms of robustness against noise for estimating the optimal number of clusters. Moreover, it could be used to estimate the optimal number of clusters in experimental data sets. It was confirmed that the proposed MFGS is a useful method for estimating the number of clusters for microarray data sets.
在聚类方法中,估计最优聚类数对于后续分析至关重要。在缺乏有关所涉及基因的详细生物学信息的情况下,评估聚类数变得困难,我们不得不依赖基于聚类结果数据分布的内部度量。间隙统计量已被提出作为一种在清晰聚类中估计聚类数的优越方法。在本研究中,我们提出了一种改进的模糊间隙统计量(MFGS)并将其应用于模糊k均值聚类。为了估计聚类数,将带有MFGS的模糊k均值聚类应用于两个带噪声的人工数据集和两个实验观察到的基因表达数据集。对于人工数据集,与其他内部度量相比,MFGS在估计最优聚类数时对噪声的鲁棒性方面表现出更高的性能。此外,它可用于估计实验数据集中的最优聚类数。证实所提出的MFGS是一种用于估计微阵列数据集聚类数的有用方法。