Biometry Research Group, Division of Cancer Prevention, National Cancer Institute, EPN 3131, 6130 Executive Blvd MSC 7354, Bethesda, MD 20892-7354, USA.
BMC Bioinformatics. 2010 Sep 8;11:452. doi: 10.1186/1471-2105-11-452.
A simple classification rule with few genes and parameters is desirable when applying a classification rule to new data. One popular simple classification rule, diagonal discriminant analysis, yields linear or curved classification boundaries, called Ripples, that are optimal when gene expression levels are normally distributed with the appropriate variance, but may yield poor classification in other situations.
A simple modification of diagonal discriminant analysis yields smooth highly nonlinear classification boundaries, called Swirls, that sometimes outperforms Ripples. In particular, if the data are normally distributed with different variances in each class, Swirls substantially outperforms Ripples when using a pooled variance to reduce the number of parameters. The proposed classification rule for two classes selects either Swirls or Ripples after parsimoniously selecting the number of genes and distance measures. Applications to five cancer microarray data sets identified predictive genes related to the tissue organization theory of carcinogenesis.
The parsimonious selection of classifiers coupled with the selection of either Swirls or Ripples provides a good basis for formulating a simple, yet flexible, classification rule. Open source software is available for download.
当将分类规则应用于新数据时,需要使用具有少量基因和参数的简单分类规则。一种流行的简单分类规则,即对角判别分析,会产生线性或曲线分类边界,称为 Ripples,当基因表达水平呈正态分布且方差适当时,这种边界是最优的,但在其他情况下可能会导致较差的分类。
对角判别分析的一个简单修改会产生平滑的高度非线性分类边界,称为 Swirls,它有时会优于 Ripples。特别是,如果数据在每个类别中呈正态分布但方差不同,当使用 pooled variance 来减少参数数量时,Swirls 会大大优于 Ripples。对于两类分类规则,在简洁地选择基因数量和距离度量之后,会选择 Swirls 或 Ripples。应用于五个癌症微阵列数据集,确定了与致癌发生的组织学理论相关的预测基因。
分类器的简约选择加上对 Swirls 或 Ripples 的选择为制定简单而灵活的分类规则提供了良好的基础。可下载开源软件。