用于基因表达数据的分聚类的 KMeans 贪婪搜索混合算法。

KMeans greedy search hybrid algorithm for biclustering gene expression data.

机构信息

Department of Computer Science, Cochin University of Science and Technology, Kochin, Kerala, India.

出版信息

Adv Exp Med Biol. 2010;680:181-8. doi: 10.1007/978-1-4419-5913-3_21.

PMID:20865500

Abstract

Microarray technology demands the development of algorithms capable of extracting novel and useful patterns like biclusters. A bicluster is a submatrix of the gene expression datamatrix such that the genes show highly correlated activities across all conditions in the submatrix. A measure called Mean Squared Residue (MSR) is used to evaluate the coherence of rows and columns within the submatrix. In this paper, the KMeans greedy search hybrid algorithm is developed for finding biclusters from the gene expression data. This algorithm has two steps. In the first step, high quality bicluster seeds are generated using KMeans clustering algorithm. In the second step, these seeds are enlarged by adding more genes and conditions using the greedy strategy. Here, the objective is to find the biclusters with maximum size and the MSR value lower than a given threshold. The biclusters obtained from this algorithm on both the bench mark datasets are of high quality. The statistical significance and biological relevance of the biclusters are verified using gene ontology database.

摘要

微阵列技术需要开发能够提取新颖有用模式的算法，例如双聚类。双聚类是基因表达数据矩阵的子矩阵，使得基因在子矩阵中的所有条件下表现出高度相关的活性。一种称为均方残差（MSR）的度量用于评估子矩阵中行和列的一致性。在本文中，开发了 KMeans 贪婪搜索混合算法，用于从基因表达数据中寻找双聚类。该算法有两个步骤。在第一步中，使用 KMeans 聚类算法生成高质量的双聚类种子。在第二步中，使用贪婪策略通过添加更多基因和条件来扩大这些种子。在这里，目标是找到具有最大大小和低于给定阈值的 MSR 值的双聚类。使用基因本体数据库验证了该算法在两个基准数据集上获得的双聚类的统计意义和生物学相关性。