Sese Jun, Kurokawa Yukinori, Monden Morito, Kato Kikuya, Morishita Shinichi
Undergraduate Program for Bioinformatics and Systems Biology, Graduate School of Frontier Sciences, University of Tokyo, Bunkyo, Tokyo, Japan.
Bioinformatics. 2004 Nov 22;20(17):3137-45. doi: 10.1093/bioinformatics/bth373. Epub 2004 Jun 24.
Gene expression profiles should be useful in distinguishing variations in disease, since they reflect accurately the status of cells. The primary clustering of gene expression reveals the genotypes that are responsible for the proximity of members within each cluster, while further clustering elucidates the pathological features of the individual members of each cluster. However, since the first clustering process and the second classification step, in which the features are associated with clusters, are performed independently, the initial set of clusters may omit genes that are associated with pathologically meaningful features. Therefore, it is important to devise a way of identifying gene expression clusters that are associated with pathological features.
We present the novel technique of 'itemset constrained clustering' (IC-Clustering), which computes the optimal cluster that maximizes the interclass variance of gene expression between groups, which are divided according to the restriction that only divisions that can be expressed using common features are allowed. This constraint automatically labels each cluster with a set of pathological features which characterize that cluster. When applied to liver cancer datasets, IC-Clustering revealed informative gene expression clusters, which could be annotated with various pathological features, such as 'tumor' and 'man', or 'except tumor' and 'normal liver function'. In contrast, the k-means method overlooked these clusters.
基因表达谱应有助于区分疾病中的变异,因为它们能准确反映细胞状态。基因表达的初始聚类揭示了导致每个聚类中成员接近的基因型,而进一步聚类则阐明了每个聚类中各个成员的病理特征。然而,由于第一步聚类过程和第二步将特征与聚类相关联的分类步骤是独立进行的,初始聚类集可能会遗漏与具有病理意义特征相关的基因。因此,设计一种识别与病理特征相关的基因表达聚类的方法很重要。
我们提出了“项集约束聚类”(IC-聚类)这一新技术,它计算最优聚类,该聚类能使根据仅允许使用共同特征进行划分这一限制而划分的组之间基因表达的类间方差最大化。这种约束会自动用一组表征该聚类的病理特征为每个聚类标注。当应用于肝癌数据集时,IC-聚类揭示了信息丰富的基因表达聚类,这些聚类可用各种病理特征进行注释,如“肿瘤”和“男性”,或“非肿瘤”和“肝功能正常”。相比之下,k均值方法忽略了这些聚类。