IEEE/ACM Trans Comput Biol Bioinform. 2018 Nov-Dec;15(6):1765-1773. doi: 10.1109/TCBB.2016.2602263. Epub 2016 Aug 24.
High dimensional biomedical datasets contain thousands of features which can be used in molecular diagnosis of disease, however, such datasets contain many irrelevant or weak correlation features which influence the predictive accuracy of diagnosis. Without a feature selection algorithm, it is difficult for the existing classification techniques to accurately identify patterns in the features. The purpose of feature selection is to not only identify a feature subset from an original set of features [without reducing the predictive accuracy of classification algorithm] but also reduce the computation overhead in data mining. In this paper, we present our improved shuffled frog leaping algorithm which introduces a chaos memory weight factor, an absolute balance group strategy, and an adaptive transfer factor. Our proposed approach explores the space of possible subsets to obtain the set of features that maximizes the predictive accuracy and minimizes irrelevant features in high-dimensional biomedical data. To evaluate the effectiveness of our proposed method, we have employed the K-nearest neighbor method with a comparative analysis in which we compare our proposed approach with genetic algorithms, particle swarm optimization, and the shuffled frog leaping algorithm. Experimental results show that our improved algorithm achieves improvements in the identification of relevant subsets and in classification accuracy.
高维生物医学数据集包含数千个特征,可用于疾病的分子诊断,然而,此类数据集包含许多不相关或弱相关的特征,这会影响诊断的预测准确性。如果没有特征选择算法,现有的分类技术就很难准确识别特征中的模式。特征选择的目的不仅是从原始特征集中选择一个特征子集[而不会降低分类算法的预测准确性],还可以降低数据挖掘中的计算开销。在本文中,我们提出了一种改进的蛙跳算法,该算法引入了混沌记忆权重因子、绝对平衡组策略和自适应转移因子。我们提出的方法探索了可能的子集空间,以获得能够最大化预测准确性并最小化高维生物医学数据中不相关特征的特征集。为了评估我们提出的方法的有效性,我们采用了 K-最近邻方法,并进行了对比分析,将我们提出的方法与遗传算法、粒子群优化和蛙跳算法进行了比较。实验结果表明,我们的改进算法在相关子集的识别和分类准确性方面都取得了提高。