Jörnsten Rebecka, Yu Bin
Department of Statistics, Rutgers University, 501 Hill Center, Piscataway, NJ 08854, USA.
Bioinformatics. 2003 Jun 12;19(9):1100-9. doi: 10.1093/bioinformatics/btg039.
The microarray technology allows for the simultaneous monitoring of thousands of genes for each sample. The high-dimensional gene expression data can be used to study similarities of gene expression profiles across different samples to form a gene clustering. The clusters may be indicative of genetic pathways. Parallel to gene clustering is the important application of sample classification based on all or selected gene expressions. The gene clustering and sample classification are often undertaken separately, or in a directional manner (one as an aid for the other). However, such separation of these two tasks may occlude informative structure in the data. Here we present an algorithm for the simultaneous clustering of genes and subset selection of gene clusters for sample classification. We develop a new model selection criterion based on Rissanen's MDL (minimum description length) principle. For the first time, an MDL code length is given for both explanatory variables (genes) and response variables (sample class labels). The final output of the proposed algorithm is a sparse and interpretable classification rule based on cluster centroids or the closest genes to the centroids.
Our algorithm for simultaneous gene clustering and subset selection for classification is applied to three publicly available data sets. For all three data sets, we obtain sparse and interpretable classification models based on centroids of clusters. At the same time, these models give competitive test error rates as the best reported methods. Compared with classification models based on single gene selections, our rules are stable in the sense that the number of clusters has a small variability and the centroids of the clusters are well correlated (or consistent) across different cross validation samples. We also discuss models where the centroids of clusters are replaced with the genes closest to the centroids. These models show comparable test error rates to models based on single gene selection, but are more sparse as well as more stable. Moreover, we comment on how the inclusion of a classification criterion affects the gene clustering, bringing out class informative structure in the data.
The methods presented in this paper have been implemented in the R language. The source code is available from the first author.
微阵列技术能够对每个样本中的数千个基因进行同步监测。高维基因表达数据可用于研究不同样本间基因表达谱的相似性,从而形成基因聚类。这些聚类可能暗示着遗传通路。与基因聚类并行的是基于全部或选定基因表达进行样本分类的重要应用。基因聚类和样本分类通常是分别进行的,或者是以一种定向的方式(一个辅助另一个)进行。然而,这两项任务的这种分离可能会掩盖数据中的信息结构。在此,我们提出一种用于基因同步聚类以及为样本分类选择基因簇子集的算法。我们基于里桑宁的最小描述长度(MDL)原则开发了一种新的模型选择标准。首次为解释变量(基因)和响应变量(样本类别标签)都给出了MDL编码长度。所提算法的最终输出是一个基于聚类中心或最接近聚类中心的基因的稀疏且可解释的分类规则。
我们用于基因同步聚类和分类子集选择的算法被应用于三个公开可用的数据集。对于所有这三个数据集,我们基于聚类中心获得了稀疏且可解释的分类模型。同时,这些模型给出的测试错误率与所报道的最佳方法具有竞争力。与基于单个基因选择的分类模型相比,我们的规则是稳定的,因为聚类数量的变异性较小,并且聚类中心在不同的交叉验证样本中具有良好的相关性(或一致性)。我们还讨论了用最接近聚类中心的基因替换聚类中心的模型。这些模型显示出与基于单个基因选择的模型相当的测试错误率,但更稀疏且更稳定。此外,我们评论了分类标准的纳入如何影响基因聚类,揭示了数据中的类别信息结构。
本文所提出的方法已用R语言实现。源代码可从第一作者处获取。