Novoselova Natalia, Tom Igor
Department of Bioinformatics, United Institute of Informatics Problems, Surganova Street 6, Minsk 220012, Belarus.
J Bioinform Comput Biol. 2012 Oct;10(5):1250011. doi: 10.1142/S0219720012500114. Epub 2012 Jun 26.
Many external and internal validity measures have been proposed in order to estimate the number of clusters in gene expression data but as a rule they do not consider the analysis of the stability of the groupings produced by a clustering algorithm. Based on the approach assessing the predictive power or stability of a partitioning, we propose the new measure of cluster validation and the selection procedure to determine the suitable number of clusters. The validity measure is based on the estimation of the "clearness" of the consensus matrix, which is the result of a resampling clustering scheme or consensus clustering. According to the proposed selection procedure the stable clustering result is determined with the reference to the validity measure for the null hypothesis encoding for the absence of clusters. The final number of clusters is selected by analyzing the distance between the validity plots for initial and permutated data sets. We applied the selection procedure to estimate the clustering results on several datasets. As a result the proposed procedure produced an accurate and robust estimate of the number of clusters, which are in agreement with the biological knowledge and gold standards of cluster quality.
为了估计基因表达数据中的聚类数量,人们提出了许多外部和内部有效性度量方法,但通常它们没有考虑对聚类算法产生的分组稳定性进行分析。基于评估划分的预测能力或稳定性的方法,我们提出了新的聚类验证度量和选择程序,以确定合适的聚类数量。该有效性度量基于对共识矩阵“清晰度”的估计,共识矩阵是重采样聚类方案或共识聚类的结果。根据提出的选择程序,参考针对无聚类的零假设编码的有效性度量来确定稳定的聚类结果。通过分析初始数据集和置换数据集的有效性图之间的距离来选择最终的聚类数量。我们将该选择程序应用于几个数据集以估计聚类结果。结果表明,所提出的程序对聚类数量进行了准确且稳健的估计,这与生物学知识和聚类质量的黄金标准一致。