Luo Chuan, Wang Sizhao, Li Tianrui, Chen Hongmei, Lv Jiancheng, Yi Zhang
IEEE Trans Neural Netw Learn Syst. 2023 Dec;34(12):10889-10903. doi: 10.1109/TNNLS.2022.3171614. Epub 2023 Nov 30.
The selection of prominent features for building more compact and efficient models is an important data preprocessing task in the field of data mining. The rough hypercuboid approach is an emerging technique that can be applied to eliminate irrelevant and redundant features, especially for the inexactness problem in approximate numerical classification. By integrating the meta-heuristic-based evolutionary search technique, a novel global search method for numerical feature selection is proposed in this article based on the hybridization of the rough hypercuboid approach and binary particle swarm optimization (BPSO) algorithm, namely RH-BPSO. To further alleviate the issue of high computational cost when processing large-scale datasets, parallelization approaches for calculating the hybrid feature evaluation criteria are presented by decomposing and recombining hypercuboid equivalence partition matrix via horizontal data partitioning. A distributed meta-heuristic optimized rough hypercuboid feature selection (DiRH-BPSO) algorithm is thus developed and embedded in the Apache Spark cloud computing model. Extensive experimental results indicate that RH-BPSO is promising and can significantly outperform the other representative feature selection algorithms in terms of classification accuracy, the cardinality of the selected feature subset, and execution efficiency. Moreover, experiments on distributed-memory multicore clusters show that DiRH-BPSO is significantly faster than its sequential counterpart and is perfectly capable of completing large-scale feature selection tasks that fail on a single node due to memory constraints. Parallel scalability and extensibility analysis also demonstrate that DiRH-BPSO could scale out and extend well with the growth of computational nodes and the volume of data.
选择显著特征以构建更紧凑、高效的模型是数据挖掘领域一项重要的数据预处理任务。粗糙超长方体方法是一种新兴技术,可用于消除不相关和冗余特征,尤其适用于近似数值分类中的不精确性问题。通过集成基于元启发式的进化搜索技术,本文基于粗糙超长方体方法与二进制粒子群优化(BPSO)算法的融合,提出了一种用于数值特征选择的新型全局搜索方法,即RH-BPSO。为了进一步缓解处理大规模数据集时计算成本高的问题,通过水平数据分区对超长方体等价划分矩阵进行分解和重组,提出了计算混合特征评估标准的并行化方法。由此开发了一种分布式元启发式优化粗糙超长方体特征选择(DiRH-BPSO)算法,并将其嵌入到Apache Spark云计算模型中。大量实验结果表明,RH-BPSO很有前景,在分类准确率、所选特征子集的基数和执行效率方面,能显著优于其他代表性特征选择算法。此外,在分布式内存多核集群上的实验表明,DiRH-BPSO比其顺序对应算法快得多,并且完全能够完成由于内存限制在单个节点上失败的大规模特征选择任务。并行可扩展性和扩展性分析还表明,DiRH-BPSO能够随着计算节点的增加和数据量的增长很好地进行扩展和延伸。