Department of Computer Science and Engineering, Green University of Bangladesh, Dhaka 1207, Bangladesh.
Department of Computer Science and Engineering, Jagannath University, Dhaka 1100, Bangladesh.
Genes (Basel). 2023 Sep 14;14(9):1802. doi: 10.3390/genes14091802.
Biomarker-based cancer identification and classification tools are widely used in bioinformatics and machine learning fields. However, the high dimensionality of microarray gene expression data poses a challenge for identifying important genes in cancer diagnosis. Many feature selection algorithms optimize cancer diagnosis by selecting optimal features. This article proposes an ensemble rank-based feature selection method (EFSM) and an ensemble weighted average voting classifier (VT) to overcome this challenge. The EFSM uses a ranking method that aggregates features from individual selection methods to efficiently discover the most relevant and useful features. The VT combines support vector machine, k-nearest neighbor, and decision tree algorithms to create an ensemble model. The proposed method was tested on three benchmark datasets and compared to existing built-in ensemble models. The results show that our model achieved higher accuracy, with 100% for leukaemia, 94.74% for colon cancer, and 94.34% for the 11-tumor dataset. This study concludes by identifying a subset of the most important cancer-causing genes and demonstrating their significance compared to the original data. The proposed approach surpasses existing strategies in accuracy and stability, significantly impacting the development of ML-based gene analysis. It detects vital genes with higher precision and stability than other existing methods.
基于生物标志物的癌症识别和分类工具在生物信息学和机器学习领域得到了广泛应用。然而,微阵列基因表达数据的高维性给癌症诊断中识别重要基因带来了挑战。许多特征选择算法通过选择最优特征来优化癌症诊断。本文提出了一种基于集成排序的特征选择方法 (EFSM) 和一种集成加权平均投票分类器 (VT) 来克服这一挑战。EFSM 使用一种排序方法,该方法从单个选择方法中聚合特征,以有效地发现最相关和最有用的特征。VT 结合支持向量机、k-最近邻和决策树算法来创建集成模型。该方法在三个基准数据集上进行了测试,并与现有的内置集成模型进行了比较。结果表明,我们的模型在白血病、结肠癌和 11 种肿瘤数据集上的准确率达到了 100%、94.74%和 94.34%。本研究通过确定一组最重要的致癌基因,并与原始数据进行比较,证明了它们的重要性。与现有的策略相比,该方法在准确性和稳定性方面表现出色,对基于 ML 的基因分析的发展具有重要影响。它比其他现有方法更精确和稳定地检测到重要基因。