Ding Chris, Peng Hanchuan
Computational Research Division, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA 94720, USA.
J Bioinform Comput Biol. 2005 Apr;3(2):185-205. doi: 10.1142/s0219720005001004.
How to selecting a small subset out of the thousands of genes in microarray data is important for accurate classification of phenotypes. Widely used methods typically rank genes according to their differential expressions among phenotypes and pick the top-ranked genes. We observe that feature sets so obtained have certain redundancy and study methods to minimize it. We propose a minimum redundancy - maximum relevance (MRMR) feature selection framework. Genes selected via MRMR provide a more balanced coverage of the space and capture broader characteristics of phenotypes. They lead to significantly improved class predictions in extensive experiments on 6 gene expression data sets: NCI, Lymphoma, Lung, Child Leukemia, Leukemia, and Colon. Improvements are observed consistently among 4 classification methods: Naive Bayes, Linear discriminant analysis, Logistic regression, and Support vector machines. SUPPLIMENTARY: The top 60 MRMR genes for each of the datasets are listed in http://crd.lbl.gov/~cding/MRMR/. More information related to MRMR methods can be found at http://www.hpeng.net/.
如何从微阵列数据中的数千个基因中选择一个小的子集对于准确分类表型很重要。广泛使用的方法通常根据基因在不同表型之间的差异表达对基因进行排名,并选择排名靠前的基因。我们观察到这样获得的特征集具有一定的冗余性,并研究了将其最小化的方法。我们提出了一种最小冗余 - 最大相关性(MRMR)特征选择框架。通过MRMR选择的基因提供了对空间更平衡的覆盖,并捕获了表型的更广泛特征。在对6个基因表达数据集(NCI、淋巴瘤、肺癌、儿童白血病、白血病和结肠癌)进行的大量实验中,它们显著提高了分类预测。在4种分类方法(朴素贝叶斯、线性判别分析、逻辑回归和支持向量机)中始终观察到改进。补充:每个数据集的前60个MRMR基因列于http://crd.lbl.gov/~cding/MRMR/。有关MRMR方法的更多信息可在http://www.hpeng.net/找到。