Nair Ajin R, Rajaguru Harikumar, Karthika M S, Keerthivasan C
Department of Electronics and Communication Engineering, Bannari Amman Institute of Technology, Sathyamangalam, India.
Bannari Amman Institute of Technology, Sathyamangalam, India.
Sci Rep. 2024 Jul 17;14(1):16485. doi: 10.1038/s41598-024-67135-1.
The microarray gene expression data poses a tremendous challenge due to their curse of dimensionality problem. The sheer volume of features far surpasses available samples, leading to overfitting and reduced classification accuracy. Thus the dimensionality of microarray gene expression data must be reduced with efficient feature extraction methods to reduce the volume of data and extract meaningful information to enhance the classification accuracy and interpretability. In this research, we discover the uniqueness of applying STFT (Short Term Fourier Transform), LASSO (Least Absolute Shrinkage and Selection Operator), and EHO (Elephant Herding Optimisation) for extracting significant features from lung cancer and reducing the dimensionality of the microarray gene expression database. The classification of lung cancer is performed using the following classifiers: Gaussian Mixture Model (GMM), Particle Swarm Optimization (PSO) with GMM, Detrended Fluctuation Analysis (DFA), Naive Bayes classifier (NBC), Firefly with GMM, Support Vector Machine with Radial Basis Kernel (SVM-RBF) and Flower Pollination Optimization (FPO) with GMM. The EHO feature extraction with the FPO-GMM classifier attained the highest accuracy in the range of 96.77, with an F1 score of 97.5, MCC of 0.92 and Kappa of 0.92. The reported results underline the significance of utilizing STFT, LASSO, and EHO for feature extraction in reducing the dimensionality of microarray gene expression data. These methodologies also help in improved and early diagnosis of lung cancer with enhanced classification accuracy and interpretability.
微阵列基因表达数据因其维度灾难问题带来了巨大挑战。特征的数量远远超过了可用样本的数量,导致过拟合和分类准确率降低。因此,必须使用有效的特征提取方法来降低微阵列基因表达数据的维度,以减少数据量并提取有意义的信息,从而提高分类准确率和可解释性。在本研究中,我们发现了应用短时傅里叶变换(STFT)、套索(LASSO)和大象群聚优化算法(EHO)从肺癌中提取显著特征并降低微阵列基因表达数据库维度的独特性。使用以下分类器对肺癌进行分类:高斯混合模型(GMM)、带有高斯混合模型的粒子群优化算法(PSO)、去趋势波动分析(DFA)、朴素贝叶斯分类器(NBC)、带有高斯混合模型的萤火虫算法、带有径向基核的支持向量机(SVM-RBF)以及带有高斯混合模型的花授粉优化算法(FPO)。采用FPO-GMM分类器的EHO特征提取方法获得了最高准确率,达到96.77%,F1分数为97.5,马修斯相关系数(MCC)为0.92,卡帕系数(Kappa)为0.92。报告结果强调了利用STFT、LASSO和EHO进行特征提取以降低微阵列基因表达数据维度的重要性。这些方法还有助于提高肺癌的诊断水平并实现早期诊断,同时增强分类准确率和可解释性。