Han Xiaoxu
Department of Mathematics and Bioinformatics Program, Eastern Michigan University, Ypsilanti, MI 48197, USA.
Genome Inform. 2008;21:200-11.
Robust cancer molecular pattern identification from microarray data not only plays an essential role in modern clinic oncology, but also presents a challenge for statistical learning. Although principal component analysis (PCA) is a widely used feature selection algorithm in microarray analysis, its holistic mechanism prevents it from capturing the latent local data structure in the following cancer molecular pattern identification. In this study, we investigate the benefit of enforcing non-negativity constraints on principal component analysis (PCA) and propose a nonnegative principal component (NPCA) based classification algorithm in cancer molecular pattern analysis for gene expression data. This novel algorithm conducts classification by classifying meta-samples of input cancer data by support vector machines (SVM) or other classic supervised learning algorithms. The meta-samples are low-dimensional projections of original cancer samples in a purely additive meta-gene subspace generated from the NPCA-induced nonnegative matrix factorization (NMF). We report strongly leading classification results from NPCA-SVM algorithm in the cancer molecular pattern identification for five benchmark gene expression datasets under 100 trials of 50% hold-out cross validations and leave one out cross validations. We demonstrate superiority of NPCA-SVM algorithm by direct comparison with seven classification algorithms: SVM, PCA-SVM, KPCA-SVM, NMF-SVM, LLE-SVM, PCA-LDA and k-NN, for the five cancer datasets in classification rates, sensitivities and specificities. Our NPCA-SVM algorithm overcomes the over-fitting problem associative with SVM-based classifications for gene expression data under a Gaussian kernel. As a more robust high-performance classifier, NPCA-SVM can be used to replace the general SVM and k-NN classifiers in cancer biomarker discovery to capture more meaningful oncogenes.
从微阵列数据中稳健地识别癌症分子模式不仅在现代临床肿瘤学中起着至关重要的作用,而且对统计学习也提出了挑战。尽管主成分分析(PCA)是微阵列分析中广泛使用的特征选择算法,但其整体机制使其在后续的癌症分子模式识别中无法捕捉潜在的局部数据结构。在本研究中,我们探讨了对主成分分析(PCA)施加非负约束的益处,并提出了一种基于非负主成分(NPCA)的分类算法用于基因表达数据的癌症分子模式分析。这种新算法通过支持向量机(SVM)或其他经典监督学习算法对输入癌症数据的元样本进行分类来进行分类。元样本是原始癌症样本在由NPCA诱导的非负矩阵分解(NMF)生成的纯加法元基因子空间中的低维投影。我们报告了在50%留出交叉验证和留一法交叉验证的100次试验中,NPCA - SVM算法在五个基准基因表达数据集的癌症分子模式识别中取得了显著领先的分类结果。通过与七种分类算法:SVM、PCA - SVM、KPCA - SVM、NMF - SVM、LLE - SVM、PCA - LDA和k - NN直接比较,我们证明了NPCA - SVM算法在五个癌症数据集的分类率、敏感性和特异性方面的优越性。我们的NPCA - SVM算法克服了基于SVM的高斯核基因表达数据分类中存在的过拟合问题。作为一种更稳健的高性能分类器,NPCA - SVM可用于替代癌症生物标志物发现中的通用SVM和k - NN分类器,以捕获更有意义的癌基因。