Department of Information Engineering, University of Padua, Padova, Italy.
Bioinformatics. 2012 Apr 15;28(8):1151-7. doi: 10.1093/bioinformatics/bts108. Epub 2012 Mar 5.
The microarray report measures the expressions of tens of thousands of genes, producing a feature vector that is high in dimensionality and that contains much irrelevant information. This dimensionality degrades classification performance. Moreover, datasets typically contain few samples for training, leading to the 'curse of dimensionality' problem. It is essential, therefore, to find good methods for reducing the size of the feature set.
In this article, we propose a method for gene microarray classification that combines different feature reduction approaches for improving classification performance. Using a support vector machine (SVM) as our classifier, we examine an SVM trained using a set of selected genes; an SVM trained using the feature set obtained by Neighborhood Preserving Embedding feature transform; a set of SVMs trained using a set of orthogonal wavelet coefficients of different wavelet mothers; a set of SVMs trained using texture descriptors extracted from the microarray, considering it as an image; and an ensemble that combines the best feature extraction methods listed above. The positive results reported offer confirmation that combining different features extraction methods greatly enhances system performance. The experiments were performed using several different datasets, and our results [expressed as both accuracy and area under the receiver operating characteristic (ROC) curve] show the goodness of the proposed approach with respect to the state of the art.
The MATHLAB code of the proposed approach is publicly available at bias.csr.unibo.it/nanni/micro.rar.
微阵列报告测量了数以万计的基因的表达,产生了一个维度很高的特征向量,其中包含了很多不相关的信息。这种维度降低了分类性能。此外,数据集通常包含很少的训练样本,导致了“维度诅咒”问题。因此,找到减少特征集大小的好方法是至关重要的。
在本文中,我们提出了一种用于基因微阵列分类的方法,该方法结合了不同的特征降维方法,以提高分类性能。使用支持向量机(SVM)作为我们的分类器,我们检查了使用一组选定基因训练的 SVM;使用邻域保持嵌入特征变换获得的特征集训练的 SVM;使用不同母小波的正交小波系数集训练的一组 SVM;使用从微阵列中提取的纹理描述符(将其视为图像)训练的一组 SVM;以及结合上述最佳特征提取方法的集成。报告的积极结果证实了组合使用不同的特征提取方法可以大大提高系统性能。实验使用了几个不同的数据集,我们的结果(表示为准确性和接收器操作特性(ROC)曲线下的面积)表明了所提出的方法相对于现有技术的优越性。
拟议方法的 MATHLAB 代码可在 bias.csr.unibo.it/nanni/micro.rar 处公开获取。