Bandyopadhyay Sanghamitra, Mukhopadhyay Anirban, Maulik Ujjwal
Machine Intelligence Unit, Indian Statistical Institute, Kolkata-700108, India.
Bioinformatics. 2007 Nov 1;23(21):2859-65. doi: 10.1093/bioinformatics/btm418. Epub 2007 Aug 25.
Recent advancements in microarray technology allows simultaneous monitoring of the expression levels of a large number of genes over different time points. Clustering is an important tool for analyzing such microarray data, typical properties of which are its inherent uncertainty, noise and imprecision. In this article, a two-stage clustering algorithm, which employs a recently proposed variable string length genetic scheme and a multiobjective genetic clustering algorithm, is proposed. It is based on the novel concept of points having significant membership to multiple classes. An iterated version of the well-known Fuzzy C-Means is also utilized for clustering.
The significant superiority of the proposed two-stage clustering algorithm as compared to the average linkage method, Self Organizing Map (SOM) and a recently developed weighted Chinese restaurant-based clustering method (CRC), widely used methods for clustering gene expression data, is established on a variety of artificial and publicly available real life data sets. The biological relevance of the clustering solutions are also analyzed.
微阵列技术的最新进展允许在不同时间点同时监测大量基因的表达水平。聚类是分析此类微阵列数据的重要工具,其典型特性包括固有的不确定性、噪声和不精确性。本文提出了一种两阶段聚类算法,该算法采用了最近提出的可变字符串长度遗传方案和多目标遗传聚类算法。它基于点对多个类具有显著隶属度的新概念。著名的模糊C均值的迭代版本也用于聚类。
在各种人工和公开可用的真实数据集上,与平均连锁法、自组织映射(SOM)和最近开发的基于加权中餐厅的聚类方法(CRC)(广泛用于聚类基因表达数据的方法)相比,所提出的两阶段聚类算法具有显著优势。还分析了聚类解决方案的生物学相关性。