Lecocke Michael, Hess Kenneth
Department of Mathematics, St. Mary's University, San Antonio, Texas 78228, USA.
Cancer Inform. 2007 Feb 23;2:313-27.
We consider both univariate- and multivariate-based feature selection for the problem of binary classification with microarray data. The idea is to determine whether the more sophisticated multivariate approach leads to better misclassification error rates because of the potential to consider jointly significant subsets of genes (but without overfitting the data).
We present an empirical study in which 10-fold cross-validation is applied externally to both a univariate-based and two multivariate- (genetic algorithm (GA)-) based feature selection processes. These procedures are applied with respect to three supervised learning algorithms and six published two-class microarray datasets.
Considering all datasets, and learning algorithms, the average 10-fold external cross-validation error rates for the univariate-, single-stage GA-, and two-stage GA-based processes are 14.2%, 14.6%, and 14.2%, respectively. We also find that the optimism bias estimates from the GA analyses were half that of the univariate approach, but the selection bias estimates from the GA analyses were 2.5 times that of the univariate results.
We find that the 10-fold external cross-validation misclassification error rates were very comparable. Further, we find that a two-stage GA approach did not demonstrate a significant advantage over a 1-stage approach. We also find that the univariate approach had higher optimism bias and lower selection bias compared to both GA approaches.
针对微阵列数据的二元分类问题,我们考虑基于单变量和多变量的特征选择。其理念是确定更复杂的多变量方法是否由于能够联合考虑具有显著意义的基因子集(但不过度拟合数据)而能带来更低的误分类错误率。
我们开展了一项实证研究,其中将10折交叉验证外部应用于基于单变量的特征选择过程以及两种基于多变量(遗传算法(GA))的特征选择过程。这些过程针对三种监督学习算法和六个已发表的两类微阵列数据集进行应用。
考虑所有数据集和学习算法,基于单变量、单阶段GA和两阶段GA的过程的平均10折外部交叉验证错误率分别为14.2%、14.6%和14.2%。我们还发现,GA分析得出的乐观偏差估计值是单变量方法的一半,但GA分析得出的选择偏差估计值是单变量结果的2.5倍。
我们发现10折外部交叉验证误分类错误率非常相近。此外,我们发现两阶段GA方法并未显示出比单阶段方法有显著优势。我们还发现,与两种GA方法相比,单变量方法具有更高的乐观偏差和更低的选择偏差。