分析核矩阵以识别差异表达基因。

Analyzing kernel matrices for the identification of differentially expressed genes.

作者信息

Xia Xiao-Lei, Xing Huanlai, Liu Xueqin

机构信息

School of Mechanical and Electrical Engineering, Jiaxing University, Jiaxing, P.R. China.

School of Information Science and Technology, Southwest Jiaotong University, Chengdu, P.R. China.

出版信息

PLoS One. 2013 Dec 9;8(12):e81683. doi: 10.1371/journal.pone.0081683. eCollection 2013.

DOI:10.1371/journal.pone.0081683

PMID:24349110

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3857896/

Abstract

One of the most important applications of microarray data is the class prediction of biological samples. For this purpose, statistical tests have often been applied to identify the differentially expressed genes (DEGs), followed by the employment of the state-of-the-art learning machines including the Support Vector Machines (SVM) in particular. The SVM is a typical sample-based classifier whose performance comes down to how discriminant samples are. However, DEGs identified by statistical tests are not guaranteed to result in a training dataset composed of discriminant samples. To tackle this problem, a novel gene ranking method namely the Kernel Matrix Gene Selection (KMGS) is proposed. The rationale of the method, which roots in the fundamental ideas of the SVM algorithm, is described. The notion of ''the separability of a sample'' which is estimated by performing [Formula: see text]-like statistics on each column of the kernel matrix, is first introduced. The separability of a classification problem is then measured, from which the significance of a specific gene is deduced. Also described is a method of Kernel Matrix Sequential Forward Selection (KMSFS) which shares the KMGS method's essential ideas but proceeds in a greedy manner. On three public microarray datasets, our proposed algorithms achieved noticeably competitive performance in terms of the B.632+ error rate.

摘要

微阵列数据最重要的应用之一是生物样本的类别预测。为此，统计测试经常被用于识别差异表达基因（DEG），随后会使用包括支持向量机（SVM）在内的最先进的学习机器。SVM是一种典型的基于样本的分类器，其性能取决于样本的判别能力。然而，通过统计测试识别出的DEG并不能保证得到一个由判别样本组成的训练数据集。为了解决这个问题，提出了一种新的基因排序方法，即核矩阵基因选择（KMGS）。描述了该方法的基本原理，其基于支持向量机算法的基本思想。首先引入了“样本可分离性”的概念，它是通过对核矩阵的每一列进行类似[公式：见正文]的统计来估计的。然后测量分类问题的可分离性，由此推导出特定基因的重要性。还描述了一种核矩阵顺序前向选择（KMSFS）方法，它与KMGS方法有相同的基本思想，但以贪婪的方式进行。在三个公共微阵列数据集上，我们提出的算法在B.632 +错误率方面取得了显著的竞争性能。