NEC Laboratories America, 10080 North Wolfe Road, Cupertino, CA 95014, USA.
IEEE/ACM Trans Comput Biol Bioinform. 2010 Jan-Mar;7(1):25-36. doi: 10.1109/TCBB.2008.35.
Gene expression data usually contain a large number of genes but a small number of samples. Feature selection for gene expression data aims at finding a set of genes that best discriminate biological samples of different types. Using machine learning techniques, traditional gene selection based on empirical mutual information suffers the data sparseness issue due to the small number of samples. To overcome the sparseness issue, we propose a model-based approach to estimate the entropy of class variables on the model, instead of on the data themselves. Here, we use multivariate normal distributions to fit the data, because multivariate normal distributions have maximum entropy among all real-valued distributions with a specified mean and standard deviation and are widely used to approximate various distributions. Given that the data follow a multivariate normal distribution, since the conditional distribution of class variables given the selected features is a normal distribution, its entropy can be computed with the log-determinant of its covariance matrix. Because of the large number of genes, the computation of all possible log-determinants is not efficient. We propose several algorithms to largely reduce the computational cost. The experiments on seven gene data sets and the comparison with other five approaches show the accuracy of the multivariate Gaussian generative model for feature selection, and the efficiency of our algorithms.
基因表达数据通常包含大量的基因,但只有少量的样本。基因表达数据的特征选择旨在找到一组能够最好地区分不同类型生物样本的基因。使用机器学习技术,基于经验互信息的传统基因选择由于样本数量较少而存在数据稀疏性问题。为了克服稀疏性问题,我们提出了一种基于模型的方法来估计模型上的类变量熵,而不是在数据本身上。在这里,我们使用多元正态分布来拟合数据,因为多元正态分布在具有指定均值和标准差的所有实值分布中具有最大的熵,并且广泛用于近似各种分布。给定数据遵循多元正态分布,由于给定所选特征的类变量的条件分布是正态分布,因此可以使用其协方差矩阵的对数行列式来计算其熵。由于基因数量众多,计算所有可能的对数行列式的效率不高。我们提出了几种算法来大大降低计算成本。在七个基因数据集上的实验以及与其他五种方法的比较表明了多元高斯生成模型在特征选择中的准确性,以及我们算法的效率。