Ma Patrick C H, Chan Keith C C
Department of Computing, the Hong Kong Polytechnic University, Hong Kong, China.
IEEE Trans Biomed Eng. 2009 Jul;56(7):1803-9. doi: 10.1109/TBME.2009.2015055. Epub 2009 Feb 20.
Many existing clustering algorithms have been used to identify coexpressed genes in gene expression data. These algorithms are used mainly to partition data in the sense that each gene is allowed to belong only to one cluster. Since proteins typically interact with different groups of proteins in order to serve different biological roles, the genes that produce these proteins are therefore expected to coexpress with more than one group of genes. In other words, some genes are expected to belong to more than one cluster. This poses a challenge to gene expression data clustering as there is a need for overlapping clusters to be discovered in a noisy environment. For this task, we propose an effective information theoretical approach, which consists of an initial clustering phase and a second reclustering phase, in this paper. The proposed approach has been tested with both simulated and real expression data. Experimental results show that it can improve the performances of existing clustering algorithms and is able to effectively uncover interesting patterns in noisy gene expression data so that, based on these patterns, overlapping clusters can be discovered.
许多现有的聚类算法已被用于识别基因表达数据中的共表达基因。这些算法主要用于对数据进行划分,即每个基因只允许属于一个簇。由于蛋白质通常与不同的蛋白质组相互作用以发挥不同的生物学作用,因此产生这些蛋白质的基因预计会与不止一组基因共表达。换句话说,一些基因预计会属于不止一个簇。这给基因表达数据聚类带来了挑战,因为需要在噪声环境中发现重叠簇。针对此任务,我们在本文中提出了一种有效的信息理论方法,该方法由初始聚类阶段和第二个重新聚类阶段组成。所提出的方法已通过模拟和真实表达数据进行了测试。实验结果表明,它可以提高现有聚类算法的性能,并能够有效地在噪声基因表达数据中发现有趣的模式,从而基于这些模式发现重叠簇。