Au Wai-Ho, Chan Keith C C, Wong Andrew K C, Wang Yang
Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong.
IEEE/ACM Trans Comput Biol Bioinform. 2005 Apr-Jun;2(2):83-101. doi: 10.1109/TCBB.2005.17.
This paper presents an attribute clustering method which is able to group genes based on their interdependence so as to mine meaningful patterns from the gene expression data. It can be used for gene grouping, selection, and classification. The partitioning of a relational table into attribute subgroups allows a small number of attributes within or across the groups to be selected for analysis. By clustering attributes, the search dimension of a data mining algorithm is reduced. The reduction of search dimension is especially important to data mining in gene expression data because such data typically consist of a huge number of genes (attributes) and a small number of gene expression profiles (tuples). Most data mining algorithms are typically developed and optimized to scale to the number of tuples instead of the number of attributes. The situation becomes even worse when the number of attributes overwhelms the number of tuples, in which case, the likelihood of reporting patterns that are actually irrelevant due to chances becomes rather high. It is for the aforementioned reasons that gene grouping and selection are important preprocessing steps for many data mining algorithms to be effective when applied to gene expression data. This paper defines the problem of attribute clustering and introduces a methodology to solving it. Our proposed method groups interdependent attributes into clusters by optimizing a criterion function derived from an information measure that reflects the interdependence between attributes. By applying our algorithm to gene expression data, meaningful clusters of genes are discovered. The grouping of genes based on attribute interdependence within group helps to capture different aspects of gene association patterns in each group. Significant genes selected from each group then contain useful information for gene expression classification and identification. To evaluate the performance of the proposed approach, we applied it to two well-known gene expression data sets and compared our results with those obtained by other methods. Our experiments show that the proposed method is able to find the meaningful clusters of genes. By selecting a subset of genes which have high multiple-interdependence with others within clusters, significant classification information can be obtained. Thus, a small pool of selected genes can be used to build classifiers with very high classification rate. From the pool, gene expressions of different categories can be identified.
本文提出了一种属性聚类方法,该方法能够根据基因之间的相互依赖性对基因进行分组,以便从基因表达数据中挖掘有意义的模式。它可用于基因分组、选择和分类。将关系表划分为属性子组可以选择组内或组间的少量属性进行分析。通过对属性进行聚类,数据挖掘算法的搜索维度得以降低。搜索维度的降低对于基因表达数据中的数据挖掘尤为重要,因为此类数据通常由大量基因(属性)和少量基因表达谱(元组)组成。大多数数据挖掘算法通常是为适应元组数量而开发和优化的,而非属性数量。当属性数量超过元组数量时,情况会变得更糟,在这种情况下,由于偶然因素报告实际不相关模式的可能性会相当高。正是由于上述原因,基因分组和选择是许多数据挖掘算法应用于基因表达数据时有效运行的重要预处理步骤。本文定义了属性聚类问题,并介绍了一种解决该问题的方法。我们提出的方法通过优化一个从反映属性间相互依赖性的信息度量导出的准则函数,将相互依赖的属性聚为簇。通过将我们的算法应用于基因表达数据,发现了有意义的基因簇。基于组内属性相互依赖性对基因进行分组有助于捕捉每组中基因关联模式的不同方面。然后从每组中选择的显著基因包含用于基因表达分类和识别的有用信息。为了评估所提方法的性能,我们将其应用于两个著名的基因表达数据集,并将我们的结果与其他方法获得的结果进行比较。我们的实验表明,所提方法能够找到有意义的基因簇。通过选择在簇内与其他基因具有高度多重依赖性的基因子集,可以获得显著的分类信息。因此,一小部分选定的基因可用于构建具有非常高分类率的分类器。从该子集中,可以识别不同类别的基因表达。