National Institute for Mathematical Sciences (NIMS), Yuseong, Daejeon 305-340, Republic of Korea.
BMC Bioinformatics. 2009 Aug 22;10:260. doi: 10.1186/1471-2105-10-260.
Uncovering subtypes of disease from microarray samples has important clinical implications such as survival time and sensitivity of individual patients to specific therapies. Unsupervised clustering methods have been used to classify this type of data. However, most existing methods focus on clusters with compact shapes and do not reflect the geometric complexity of the high dimensional microarray clusters, which limits their performance.
We present a cluster-number-based ensemble clustering algorithm, called MULTI-K, for microarray sample classification, which demonstrates remarkable accuracy. The method amalgamates multiple k-means runs by varying the number of clusters and identifies clusters that manifest the most robust co-memberships of elements. In addition to the original algorithm, we newly devised the entropy-plot to control the separation of singletons or small clusters. MULTI-K, unlike the simple k-means or other widely used methods, was able to capture clusters with complex and high-dimensional structures accurately. MULTI-K outperformed other methods including a recently developed ensemble clustering algorithm in tests with five simulated and eight real gene-expression data sets.
The geometric complexity of clusters should be taken into account for accurate classification of microarray data, and ensemble clustering applied to the number of clusters tackles the problem very well. The C++ code and the data sets tested are available from the authors.
从微阵列样本中发现疾病亚型具有重要的临床意义,例如患者的生存时间和对特定治疗的敏感性。已使用无监督聚类方法对这类数据进行分类。然而,大多数现有方法侧重于形状紧凑的聚类,而不能反映高维微阵列聚类的几何复杂性,这限制了它们的性能。
我们提出了一种基于聚类数的集成聚类算法 MULTI-K,用于微阵列样本分类,该算法具有出色的准确性。该方法通过改变聚类数来合并多个 k-均值运行,并识别出表现出最稳健元素共同成员关系的聚类。除了原始算法,我们还新设计了熵图来控制单例或小聚类的分离。与简单的 k-均值或其他广泛使用的方法不同,MULTI-K 能够准确地捕获具有复杂和高维结构的聚类。MULTI-K 在五个模拟和八个真实基因表达数据集的测试中优于其他方法,包括最近开发的一种集成聚类算法。
为了准确分类微阵列数据,应考虑聚类的几何复杂性,并且应用于聚类数的集成聚类很好地解决了该问题。作者提供了 C++代码和测试数据集。