Hussain Ibrar, Qureshi Moiz, Ismail Muhammad, Iftikhar Hasnain, Zywiołek Justyna, López-Gonzales Javier Linkolk
Department of Statistics Abdul Wali Khan University Mardan, Pakistan.
Govt Boys Degree College Tandojam, Hyderabad, Sindh, Pakistan.
Heliyon. 2024 Sep 2;10(17):e37241. doi: 10.1016/j.heliyon.2024.e37241. eCollection 2024 Sep 15.
Bio-informatics and gene expression analysis face major hurdles when dealing with high-dimensional data, where the number of variables or genes much outweighs the number of samples. These difficulties are exacerbated, particularly in microarray data processing, by redundant genes that do not significantly contribute to the response variable. To address this issue, gene selection emerges as a feasible method for identifying the most important genes, hence reducing the generalization error of classification algorithms. This paper introduces a new hybrid approach for gene selection by combining the Signal-to-Noise Ratio (SNR) score with the robust Mood median test. The Mood median test is beneficial for reducing the impact of outliers in non-normal or skewed data since it may successfully identify genes with significant changes across groups. The SNR score measures the significance of a gene's classification by comparing the gap between class means and within-class variability. By integrating both of these approaches, the suggested approach aims to find genes that are significant for classification tasks. The major objective of this study is to evaluate the effectiveness of this combination approach in choosing the optimal genes. A significant P-value is consistently identified for each gene using the Mood median test and the SNR score. By dividing the SNR value of each gene by its significant P-value, the Md score is calculated. Genes with a high signal-to-noise ratio (SNR) have been considered favorable due to their minimal noise influence and significant classification importance. To verify the effectiveness of the selected genes, the study utilizes two dependable classification techniques: Random Forest and K-Nearest Neighbors (KNN). These algorithms were chosen due to their track record of successfully completing categorization-related tasks. The performance of the selected genes is evaluated using two metrics: error reduction and classification accuracy. These metrics offer an in-depth assessment of how well the selected genes improve classification accuracy and consistency. According to the findings, the hybrid approach put out here outperforms conventional gene selection methods in high-dimensional datasets and has lower classification error rates. There are considerable improvements in classification accuracy and error reduction when specific genes are exposed to the Random Forest and KNN classifiers. The outcomes demonstrate how this hybrid technique might be a helpful tool to improve gene selection processes in bioinformatics.
在处理高维数据时,生物信息学和基因表达分析面临着重大障碍,其中变量或基因的数量远远超过样本数量。这些困难在微阵列数据处理中尤其严重,因为存在对响应变量没有显著贡献的冗余基因。为了解决这个问题,基因选择作为一种可行的方法出现了,用于识别最重要的基因,从而降低分类算法的泛化误差。本文介绍了一种新的基因选择混合方法,该方法将信噪比(SNR)评分与稳健的穆德中位数检验相结合。穆德中位数检验有利于减少非正态或偏态数据中异常值的影响,因为它可以成功识别不同组间有显著变化的基因。信噪比评分通过比较类均值之间的差距和类内变异性来衡量基因分类的显著性。通过整合这两种方法,所提出的方法旨在找到对分类任务具有显著性的基因。本研究的主要目的是评估这种组合方法在选择最佳基因方面的有效性。使用穆德中位数检验和信噪比评分,为每个基因持续确定一个显著的P值。通过将每个基因的信噪比(SNR)值除以其显著的P值,计算出Md评分。具有高信噪比(SNR)的基因由于其最小的噪声影响和显著的分类重要性而被认为是有利的。为了验证所选基因的有效性,该研究使用了两种可靠的分类技术:随机森林和K近邻(KNN)。选择这些算法是因为它们在成功完成与分类相关任务方面的记录。使用两个指标评估所选基因的性能:误差减少和分类准确率。这些指标深入评估了所选基因在提高分类准确率和一致性方面的效果。根据研究结果,本文提出的混合方法在高维数据集中优于传统的基因选择方法,并且具有较低的分类错误率。当特定基因应用于随机森林和KNN分类器时,分类准确率和误差减少有显著提高。结果表明,这种混合技术可能是一种有助于改进生物信息学中基因选择过程的有用工具。