Maji Pradipta
Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700 108, India.
IEEE Trans Biomed Eng. 2009 Apr;56(4):1063-9. doi: 10.1109/TBME.2008.2004502. Epub 2008 Sep 16.
Among the great amount of genes presented in microarray gene expression data, only a small fraction is effective for performing a certain diagnostic test. In this regard, mutual information has been shown to be successful for selecting a set of relevant and nonredundant genes from microarray data. However, information theory offers many more measures such as the f-information measures that may be suitable for selection of genes from microarray gene expression data. This paper presents different f-information measures as the evaluation criteria for gene selection problem. To compute the gene-gene redundancy (respectively, gene-class relevance), these information measures calculate the divergence of the joint distribution of two genes' expression values (respectively, the expression values of a gene and the class labels of samples) from the joint distribution when two genes (respectively, the gene and class label) are considered to be completely independent. The performance of different f-information measures is compared with that of the mutual information based on the predictive accuracy of naive Bayes classifier, K -nearest neighbor rule, and support vector machine. An important finding is that some f-information measures are shown to be effective for selecting relevant and nonredundant genes from microarray data. The effectiveness of different f-information measures, along with a comparison with mutual information, is demonstrated on breast cancer, leukemia, and colon cancer datasets. While some f -information measures provide 100% prediction accuracy for all three microarray datasets, mutual information attains this accuracy only for breast cancer dataset, and 98.6% and 93.6% for leukemia and colon cancer datasets, respectively.
在微阵列基因表达数据中呈现的大量基因中,只有一小部分对执行特定诊断测试有效。在这方面,互信息已被证明可成功地从微阵列数据中选择一组相关且非冗余的基因。然而,信息论还提供了更多的度量,例如f - 信息度量,这些度量可能适用于从微阵列基因表达数据中选择基因。本文提出了不同的f - 信息度量作为基因选择问题的评估标准。为了计算基因 - 基因冗余度(分别地,基因 - 类别相关性),这些信息度量计算当两个基因(分别地,基因和类别标签)被认为完全独立时,两个基因表达值的联合分布(分别地,一个基因的表达值和样本的类别标签)与联合分布的差异。基于朴素贝叶斯分类器、K近邻规则和支持向量机的预测准确性,将不同f - 信息度量的性能与互信息的性能进行了比较。一个重要的发现是,一些f - 信息度量被证明对从微阵列数据中选择相关且非冗余的基因是有效的。在乳腺癌、白血病和结肠癌数据集上展示了不同f - 信息度量的有效性以及与互信息的比较。虽然一些f - 信息度量对所有三个微阵列数据集都提供了100%的预测准确性,但互信息仅对乳腺癌数据集达到此准确性,对白血病和结肠癌数据集分别为98.6%和93.6%。