Department of Statistics, University of California, 8125 Math Sciences Building, Box 951554, Los Angeles, CA 90095-1554, USA.
Proc Natl Acad Sci U S A. 2010 Apr 13;107(15):6737-42. doi: 10.1073/pnas.0910140107. Epub 2010 Mar 25.
To many biomedical researchers, effective tumor classification methods such as the support vector machine often appear like a black box not only because the procedures are complex but also because the required specifications, such as the choice of a kernel function, suffer from a clear guidance either mathematically or biologically. As commonly observed, samples within the same tumor class tend to be more similar in gene expression than samples from different tumor classes. But can this well-received observation lead to a useful procedure of classification and prediction? To address this issue, we first conceived a statistical framework and derived general conditions to serve as the theoretical foundation that supported the aforementioned empirical observation. Then we constructed a classification procedure that fully utilized the information obtained by comparing the distributions of within-class correlations with between-class correlations via Kullback-Leibler divergence. We compared our approach with many machine-learning techniques by applying to 22 binary- and multiclass gene-expression datasets involving human cancers. The results showed that our method performed as efficiently as support vector machine and Naïve Bayesian and outperformed other learning methods (decision trees, linear discriminate analysis, and k-nearest neighbor). In addition, we conducted a simulation study and showed that our method would be more effective if the arriving new samples are subject to the often-encountered baseline shift or increased noise level problems. Our method can be extended for general classification problems when only the similarity scores between samples are available.
对于许多生物医学研究人员来说,有效的肿瘤分类方法,如支持向量机,不仅因为程序复杂,还因为所需的规格(如核函数的选择)在数学上或生物学上都没有明确的指导,所以看起来就像一个黑盒子。通常观察到的是,同一肿瘤类别的样本在基因表达上比不同肿瘤类别的样本更相似。但是,这种广受欢迎的观察结果能否带来有用的分类和预测程序呢?为了解决这个问题,我们首先构思了一个统计框架,并得出了一些普遍的条件,作为支持上述经验观察的理论基础。然后,我们构建了一个分类程序,该程序充分利用了通过 Kullback-Leibler 散度比较类内相关性和类间相关性分布所获得的信息。我们通过将其应用于涉及人类癌症的 22 个二分类和多分类基因表达数据集,将我们的方法与许多机器学习技术进行了比较。结果表明,我们的方法与支持向量机和朴素贝叶斯一样有效,优于其他学习方法(决策树、线性判别分析和 k-最近邻)。此外,我们进行了一项模拟研究,结果表明,如果新样本受到基线偏移或噪声水平增加等常见问题的影响,我们的方法将更加有效。当只有样本之间的相似性得分可用时,我们的方法可以扩展到一般的分类问题。