Chen Pei-Chun, Huang Su-Yun, Chen Wei J, Hsiao Chuhsing K
1Bioinformatics and Biostatistics Core Laboratory, National Taiwan University, Taipei, Taiwan, Republic of China.
BMC Bioinformatics. 2009 Feb 3;10:44. doi: 10.1186/1471-2105-10-44.
Selection of influential genes with microarray data often faces the difficulties of a large number of genes and a relatively small group of subjects. In addition to the curse of dimensionality, many gene selection methods weight the contribution from each individual subject equally. This equal-contribution assumption cannot account for the possible dependence among subjects who associate similarly to the disease, and may restrict the selection of influential genes.
A novel approach to gene selection is proposed based on kernel similarities and kernel weights. We do not assume uniformity for subject contribution. Weights are calculated via regularized least squares support vector regression (RLS-SVR) of class levels on kernel similarities and are used to weight subject contribution. The cumulative sum of weighted expression levels are next ranked to select responsible genes. These procedures also work for multiclass classification. We demonstrate this algorithm on acute leukemia, colon cancer, small, round blue cell tumors of childhood, breast cancer, and lung cancer studies, using kernel Fisher discriminant analysis and support vector machines as classifiers. Other procedures are compared as well.
This approach is easy to implement and fast in computation for both binary and multiclass problems. The gene set provided by the RLS-SVR weight-based approach contains a less number of genes, and achieves a higher accuracy than other procedures.
利用微阵列数据选择有影响力的基因常常面临基因数量众多而样本量相对较小的困难。除了维数灾难之外,许多基因选择方法对每个个体样本的贡献给予同等的权重。这种等贡献假设无法解释与疾病关联相似的样本之间可能存在的依赖性,并且可能会限制有影响力基因的选择。
提出了一种基于核相似性和核权重的基因选择新方法。我们不假设样本贡献的一致性。通过对核相似性上的类别水平进行正则化最小二乘支持向量回归(RLS-SVR)来计算权重,并用于加权样本贡献。接下来,对加权表达水平的累积和进行排序以选择相关基因。这些步骤也适用于多类分类。我们使用核Fisher判别分析和支持向量机作为分类器,在急性白血病、结肠癌、儿童小圆蓝细胞瘤、乳腺癌和肺癌研究中展示了该算法。同时也比较了其他方法。
这种方法对于二分类和多类问题都易于实现且计算速度快。基于RLS-SVR权重的方法提供的基因集包含的基因数量较少,并且比其他方法具有更高的准确性。