Shahjaman Md, Kumar Nishith, Ahmed Md Shakil, Begum AnjumanAra, Islam S M Shahinul, Mollah Md Nurul Haque
Bioinformatics Lab, Department of Statistics, University of Rajshahi-6205, Bangladesh.
Department of Statistics, Begum Rokeya University, Rangpur-5400, Bangladesh.
Bioinformation. 2017 Oct 31;13(10):327-332. doi: 10.6026/97320630013327. eCollection 2017.
Patient classification through feature selection (FS) based on gene expression data (GED) has already become popular to the research communities. T-test is the well-known statistical FS method in GED analysis. However, it produces higher false positives and lower accuracies for small sample sizes or in presence of outliers. To get rid from the shortcomings of t-test with small sample sizes, SAM has been applied in GED. But, it is highly sensitive to outliers. Recently, robust SAM using the minimum β-divergence estimators has overcome all the problems of classical t-test & SAM and it has been successfully applied for identification of differentially expressed (DE) genes. But, it was not applied in classification. Therefore, in this paper, we employ robust SAM as a feature selection approach along with classifiers for patient classification. We demonstrate the performance of the robust SAM in a comparison of classical t-test and SAM along with four popular classifiers (LDA, KNN, SVM and naive Bayes) using both simulated and real gene expression datasets. The results obtained from simulation and real data analysis confirm that the performance of the four classifiers improve with robust SAM than the classical t-test and SAM. From a real Colon cancer dataset we identified 21 additional DE genes using robust SAM that were not identified by the classical t-test or SAM. To reveal the biological functions and pathways of these 21 genes, we perform KEGG pathway enrichment analysis and found that these genes are involved in some important pathways related to cancer disease.
基于基因表达数据(GED)通过特征选择(FS)进行患者分类已在研究界受到广泛关注。t检验是GED分析中著名的统计特征选择方法。然而,对于小样本量或存在异常值的情况,它会产生较高的假阳性率和较低的准确率。为了克服小样本量t检验的缺点,SAM已应用于GED分析。但是,它对异常值高度敏感。最近,使用最小β散度估计器的稳健SAM克服了经典t检验和SAM的所有问题,并已成功应用于差异表达(DE)基因的识别。但是,它尚未应用于分类。因此,在本文中,我们将稳健SAM作为一种特征选择方法,并结合分类器用于患者分类。我们使用模拟和真实基因表达数据集,在与经典t检验和SAM以及四个流行分类器(LDA、KNN、SVM和朴素贝叶斯)的比较中展示了稳健SAM的性能。从模拟和实际数据分析中获得的结果证实,与经典t检验和SAM相比,使用稳健SAM时四个分类器的性能有所提高。从一个真实的结肠癌数据集中,我们使用稳健SAM鉴定出了21个额外的DE基因,而经典t检验或SAM并未识别出这些基因。为了揭示这21个基因的生物学功能和途径,我们进行了KEGG途径富集分析,发现这些基因参与了一些与癌症疾病相关的重要途径。