Chen Zhenyu, Li Jianping, Wei Liwei
Institute of Policy & Management, Chinese Academy of Sciences, Beijing 100080, China.
Artif Intell Med. 2007 Oct;41(2):161-75. doi: 10.1016/j.artmed.2007.07.008. Epub 2007 Sep 11.
Recently, gene expression profiling using microarray techniques has been shown as a promising tool to improve the diagnosis and treatment of cancer. Gene expression data contain high level of noise and the overwhelming number of genes relative to the number of available samples. It brings out a great challenge for machine learning and statistic techniques. Support vector machine (SVM) has been successfully used to classify gene expression data of cancer tissue. In the medical field, it is crucial to deliver the user a transparent decision process. How to explain the computed solutions and present the extracted knowledge becomes a main obstacle for SVM.
A multiple kernel support vector machine (MK-SVM) scheme, consisting of feature selection, rule extraction and prediction modeling is proposed to improve the explanation capacity of SVM. In this scheme, we show that the feature selection problem can be translated into an ordinary multiple parameters learning problem. And a shrinkage approach: 1-norm based linear programming is proposed to obtain the sparse parameters and the corresponding selected features. We propose a novel rule extraction approach using the information provided by the separating hyperplane and support vectors to improve the generalization capacity and comprehensibility of rules and reduce the computational complexity.
Two public gene expression datasets: leukemia dataset and colon tumor dataset are used to demonstrate the performance of this approach. Using the small number of selected genes, MK-SVM achieves encouraging classification accuracy: more than 90% for both two datasets. Moreover, very simple rules with linguist labels are extracted. The rule sets have high diagnostic power because of their good classification performance.
最近,使用微阵列技术进行基因表达谱分析已被证明是一种改善癌症诊断和治疗的有前途的工具。基因表达数据包含高水平的噪声,并且相对于可用样本数量而言基因数量众多。这给机器学习和统计技术带来了巨大挑战。支持向量机(SVM)已成功用于对癌组织的基因表达数据进行分类。在医学领域,向用户提供透明的决策过程至关重要。如何解释计算出的解决方案并呈现提取的知识成为SVM的主要障碍。
提出了一种由特征选择、规则提取和预测建模组成的多核支持向量机(MK-SVM)方案,以提高SVM的解释能力。在该方案中,我们表明特征选择问题可以转化为一个普通的多参数学习问题。并且提出了一种收缩方法:基于1-范数的线性规划,以获得稀疏参数和相应的所选特征。我们提出了一种新颖的规则提取方法,利用分离超平面和支持向量提供的信息来提高规则的泛化能力和可理解性,并降低计算复杂度。
使用两个公共基因表达数据集:白血病数据集和结肠肿瘤数据集来证明该方法的性能。利用少量选定的基因,MK-SVM取得了令人鼓舞的分类准确率:两个数据集均超过90%。此外,提取了带有语言标签的非常简单的规则。由于其良好的分类性能,这些规则集具有很高的诊断能力。