基于改进的鹽蝽群算法的基因表达数据分类的两阶段特征选择

Two-stage feature selection for classification of gene expression data based on an improved Salp Swarm Algorithm.

作者信息

Qin Xiwen, Zhang Shuang, Yin Dongmei, Chen Dongxue, Dong Xiaogang

机构信息

School of Mathematics and Statistics, Changchun University of Technology, Changchun 130012, China.

出版信息

Math Biosci Eng. 2022 Sep 19;19(12):13747-13781. doi: 10.3934/mbe.2022641.

DOI:10.3934/mbe.2022641

PMID:36654066

Abstract

Microarray technology has developed rapidly in recent years, producing a large number of ultra-high dimensional gene expression data. However, due to the huge sample size and dimension proportion of gene expression data, it is very challenging work to screen important genes from gene expression data. For small samples of high-dimensional biomedical data, this paper proposes a two-stage feature selection framework combining Wrapper, embedding and filtering to avoid the curse of dimensionality. The proposed framework uses weighted gene co-expression network (WGCNA), random forest and minimal redundancy maximal relevance (mRMR) for first stage feature selection. In the second stage, a new gene selection method based on the improved binary Salp Swarm Algorithm is proposed, which combines machine learning methods to adaptively select feature subsets suitable for classification algorithms. Finally, the classification accuracy is evaluated using six methods: lightGBM, RF, SVM, XGBoost, MLP and KNN. To verify the performance of the framework and the effectiveness of the proposed algorithm, the number of genes selected and the classification accuracy was compared with the other five intelligent optimization algorithms. The results show that the proposed framework achieves an accuracy equal to or higher than other advanced intelligent algorithms on 10 datasets, and achieves an accuracy of over 97.6% on all 10 datasets. This shows that the method proposed in this paper can solve the feature selection problem related to high-dimensional data, and the proposed framework has no data set limitation, and it can be applied to other fields involving feature selection.

摘要

近年来，微阵列技术发展迅速，产生了大量超高维基因表达数据。然而，由于基因表达数据的样本量巨大和维度比例高，从基因表达数据中筛选重要基因是一项极具挑战性的工作。针对高维生物医学小样本数据，本文提出了一种结合包装法、嵌入法和过滤法的两阶段特征选择框架，以避免维数灾难。所提出的框架在第一阶段特征选择中使用加权基因共表达网络（WGCNA）、随机森林和最小冗余最大相关度（mRMR）。在第二阶段，提出了一种基于改进二进制萨尔普群算法的新基因选择方法，该方法结合机器学习方法自适应地选择适合分类算法的特征子集。最后，使用轻量级梯度提升机（lightGBM）、随机森林（RF）、支持向量机（SVM）、极端梯度提升（XGBoost）、多层感知器（MLP）和K近邻（KNN）六种方法评估分类准确率。为了验证该框架的性能和所提算法的有效性，将所选基因数量和分类准确率与其他五种智能优化算法进行了比较。结果表明，所提出的框架在10个数据集上实现了等于或高于其他先进智能算法的准确率，并且在所有10个数据集上都达到了97.6%以上的准确率。这表明本文提出的方法能够解决与高维数据相关的特征选择问题，并且所提出的框架没有数据集限制，可应用于其他涉及特征选择的领域。