Suppr超能文献

一种用于基因表达数据分类的高效统计特征选择方法。

An efficient statistical feature selection approach for classification of gene expression data.

机构信息

Department of Mathematics, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India.

出版信息

J Biomed Inform. 2011 Aug;44(4):529-35. doi: 10.1016/j.jbi.2011.01.001. Epub 2011 Jan 15.

Abstract

Classification of gene expression data plays a significant role in prediction and diagnosis of diseases. Gene expression data has a special characteristic that there is a mismatch in gene dimension as opposed to sample dimension. All genes do not contribute for efficient classification of samples. A robust feature selection algorithm is required to identify the important genes which help in classifying the samples efficiently. In order to select informative genes (features) based on relevance and redundancy characteristics, many feature selection algorithms have been introduced in the past. Most of the earlier algorithms require computationally expensive search strategy to find an optimal feature subset. Existing feature selection methods are also sensitive to the evaluation measures. The paper introduces a novel and efficient feature selection approach based on statistically defined effective range of features for every class termed as ERGS (Effective Range based Gene Selection). The basic principle behind ERGS is that higher weight is given to the feature that discriminates the classes clearly. Experimental results on well-known gene expression datasets illustrate the effectiveness of the proposed approach. Two popular classifiers viz. Nave Bayes Classifier (NBC) and Support Vector Machine (SVM) have been used for classification. The proposed feature selection algorithm can be helpful in ranking the genes and also is capable of identifying the most relevant genes responsible for diseases like leukemia, colon tumor, lung cancer, diffuse large B-cell lymphoma (DLBCL), prostate cancer.

摘要

基因表达数据的分类在疾病的预测和诊断中起着重要的作用。基因表达数据有一个特殊的特征,即基因维度与样本维度不匹配。并非所有基因都有助于对样本进行有效的分类。需要一种强大的特征选择算法来识别重要基因,以有效地对样本进行分类。为了根据相关性和冗余性特征选择信息性基因(特征),过去已经引入了许多特征选择算法。大多数早期算法需要计算成本高昂的搜索策略来找到最佳特征子集。现有的特征选择方法也对评估措施很敏感。本文提出了一种新颖有效的特征选择方法,该方法基于对每个类别定义的统计有效特征范围,称为 ERGS(基于有效范围的基因选择)。ERGS 的基本原理是,对能更清晰地区分类别的特征赋予更高的权重。在著名的基因表达数据集上的实验结果说明了所提出方法的有效性。已经使用了两种流行的分类器,即朴素贝叶斯分类器(NBC)和支持向量机(SVM)进行分类。所提出的特征选择算法可以帮助对基因进行排序,并且还能够识别出导致白血病、结肠癌、肺癌、弥漫性大 B 细胞淋巴瘤(DLBCL)、前列腺癌等疾病的最相关基因。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验