Turfan Derya, Altunkaynak Bulent, Yeniay Özgür
Department of Statistics, Hacettepe University, Ankara, Turkey.
Department of Statistics, Gazi University, Ankara, Turkey.
Big Data. 2024 Aug;12(4):312-330. doi: 10.1089/big.2022.0086. Epub 2023 Sep 4.
Over the years, many studies have been carried out to reduce and eliminate the effects of diseases on human health. Gene expression data sets play a critical role in diagnosing and treating diseases. These data sets consist of thousands of genes and a small number of sample sizes. This situation creates the curse of dimensionality and it becomes problematic to analyze such data sets. One of the most effective strategies to solve this problem is feature selection methods. Feature selection is a preprocessing step to improve classification performance by selecting the most relevant and informative features while increasing the accuracy of classification. In this article, we propose a new statistically based filter method for the feature selection approach named Effective Range-based Feature Selection Algorithm (FSAER). As an extension of the previous Effective Range based Gene Selection (ERGS) and Improved Feature Selection based on Effective Range (IFSER) algorithms, our novel method includes the advantages of both methods while taking into account the disjoint area. To illustrate the efficacy of the proposed algorithm, the experiments have been conducted on six benchmark gene expression data sets. The results of the FSAER and the other filter methods have been compared in terms of classification accuracies to demonstrate the effectiveness of the proposed method. For classification methods, support vector machines, naive Bayes classifier, and k-nearest neighbor algorithms have been used.
多年来,人们进行了许多研究以减少和消除疾病对人类健康的影响。基因表达数据集在疾病诊断和治疗中起着关键作用。这些数据集由数千个基因和少量样本组成。这种情况产生了维数灾难,分析此类数据集变得很困难。解决这个问题最有效的策略之一是特征选择方法。特征选择是一个预处理步骤,通过选择最相关和信息丰富的特征来提高分类性能,同时提高分类的准确性。在本文中,我们为特征选择方法提出了一种新的基于统计的过滤方法,称为基于有效范围的特征选择算法(FSAER)。作为先前基于有效范围的基因选择(ERGS)和基于有效范围的改进特征选择(IFSER)算法的扩展,我们的新方法在考虑不相交区域的同时,兼具了这两种方法的优点。为了说明所提出算法的有效性,我们在六个基准基因表达数据集上进行了实验。通过比较FSAER与其他过滤方法的分类准确率,来证明所提方法的有效性。对于分类方法,我们使用了支持向量机、朴素贝叶斯分类器和k近邻算法。