Faculty of Computer Engineering, K. N. Toosi University of Technology, Tehran, Iran.
BMC Med Inform Decis Mak. 2021 Nov 27;21(1):333. doi: 10.1186/s12911-021-01696-3.
Gene expression data play an important role in bioinformatics applications. Although there may be a large number of features in such data, they mainly tend to contain only a few samples. This can negatively impact the performance of data mining and machine learning algorithms. One of the most effective approaches to alleviate this problem is to use gene selection methods. The aim of gene selection is to reduce the dimensions (features) of gene expression data leading to eliminating irrelevant and redundant genes.
This paper presents a hybrid gene selection method based on graph theory and a many-objective particle swarm optimization (PSO) algorithm. To this end, a filter method is first utilized to reduce the initial space of the genes. Then, the gene space is represented as a graph to apply a graph clustering method to group the genes into several clusters. Moreover, the many-objective PSO algorithm is utilized to search an optimal subset of genes according to several criteria, which include classification error, node centrality, specificity, edge centrality, and the number of selected genes. A repair operator is proposed to cover the whole space of the genes and ensure that at least one gene is selected from each cluster. This leads to an increasement in the diversity of the selected genes.
To evaluate the performance of the proposed method, extensive experiments are conducted based on seven datasets and two evaluation measures. In addition, three classifiers-Decision Tree (DT), Support Vector Machine (SVM), and K-Nearest Neighbors (KNN)-are utilized to compare the effectiveness of the proposed gene selection method with other state-of-the-art methods. The results of these experiments demonstrate that our proposed method not only achieves more accurate classification, but also selects fewer genes than other methods.
This study shows that the proposed multi-objective PSO algorithm simultaneously removes irrelevant and redundant features using several different criteria. Also, the use of the clustering algorithm and the repair operator has improved the performance of the proposed method by covering the whole space of the problem.
基因表达数据在生物信息学应用中起着重要作用。尽管这些数据中可能有大量的特征,但它们主要倾向于只包含少数样本。这会对数据挖掘和机器学习算法的性能产生负面影响。缓解这个问题的最有效方法之一是使用基因选择方法。基因选择的目的是减少基因表达数据的维度(特征),从而消除不相关和冗余的基因。
本文提出了一种基于图论和多目标粒子群优化(PSO)算法的混合基因选择方法。为此,首先利用过滤方法来减少基因的初始空间。然后,将基因空间表示为一个图,应用图聚类方法将基因分为几个簇。此外,利用多目标 PSO 算法根据分类错误、节点中心度、特异性、边中心度和选择的基因数量等多个标准搜索最优的基因子集。提出了一个修复算子来覆盖基因的整个空间,并确保从每个簇中至少选择一个基因。这导致所选基因的多样性增加。
为了评估所提出方法的性能,基于七个数据集和两个评估指标进行了广泛的实验。此外,利用三个分类器——决策树(DT)、支持向量机(SVM)和 K-最近邻(KNN)——将所提出的基因选择方法与其他最先进的方法进行了有效性比较。这些实验的结果表明,所提出的方法不仅实现了更准确的分类,而且选择的基因比其他方法更少。
本研究表明,所提出的多目标 PSO 算法使用多个不同的标准同时去除不相关和冗余的特征。此外,聚类算法和修复算子的使用通过覆盖问题的整个空间,提高了所提出方法的性能。