Li Haifeng, Zhang Keshu, Jiang Tao
Dept. of Computer Science, University of California at Riverside, Riverside, CA 92521, USA.
Proc IEEE Comput Syst Bioinform Conf. 2005:310-21. doi: 10.1109/csb.2005.49.
Robust and accurate cancer classification is critical in cancer treatment. Gene expression profiling is expected to enable us to diagnose tumors precisely and systematically. However, the classification task in this context is very challenging because of the curse of dimensionality and the small sample size problem. In this paper, we propose a novel method to solve these two problems. Our method is able to map gene expression data into a very low dimensional space and thus meets the recommended samples to features per class ratio. As a result, it can be used to classify new samples robustly with low and trustable (estimated) error rates. The method is based on linear discriminant analysis (LDA). However, the conventional LDA requires that the within-class scatter matrix S(w) be nonsingular. Unfortunately, Sw is always singular in the case of cancer classification due to the small sample size problem. To overcome this problem, we develop a generalized linear discriminant analysis (GLDA) that is a general, direct, and complete solution to optimize Fisher's criterion. GLDA is mathematically well-founded and coincides with the conventional LDA when S(w) is nonsingular. Different from the conventional LDA, GLDA does not assume the nonsingularity of S(w), and thus naturally solves the small sample size problem. To accommodate the high dimensionality of scatter matrices, a fast algorithm of GLDA is also developed. Our extensive experiments on seven public cancer datasets show that the method performs well. Especially on some difficult instances that have very small samples to genes per class ratios, our method achieves much higher accuracies than widely used classification methods such as support vector machines, random forests, etc.
强大且准确的癌症分类在癌症治疗中至关重要。基因表达谱分析有望使我们能够精确且系统地诊断肿瘤。然而,由于维度诅咒和小样本量问题,在此背景下的分类任务极具挑战性。在本文中,我们提出了一种新颖的方法来解决这两个问题。我们的方法能够将基因表达数据映射到一个极低维空间,从而满足每类推荐的样本与特征比例。因此,它可用于以低且可靠(估计)的错误率对新样本进行稳健分类。该方法基于线性判别分析(LDA)。然而,传统的LDA要求类内散度矩阵S(w)是非奇异的。不幸的是,由于小样本量问题,在癌症分类的情况下Sw总是奇异的。为克服这个问题,我们开发了一种广义线性判别分析(GLDA),它是优化Fisher准则的一种通用、直接且完整的解决方案。GLDA在数学上有充分依据,并且当S(w)非奇异时与传统的LDA一致。与传统的LDA不同,GLDA不假设S(w)的非奇异性,从而自然地解决了小样本量问题。为适应散度矩阵的高维度,还开发了一种GLDA的快速算法。我们在七个公开癌症数据集上进行的广泛实验表明该方法性能良好。特别是在一些每类样本与基因比例非常小的困难实例上,我们的方法比支持向量机、随机森林等广泛使用的分类方法取得了高得多的准确率。