Yan Xiaoran, Shang Shilong, Li Dongxi, Dang Yun
College of Artificial Intelligence, Taiyuan University of Technology, Taiyuan, Shanxi, China.
College of Computer Science and Technology, Taiyuan University of Technology, Taiyuan, Shanxi, China.
Sci Rep. 2025 Aug 17;15(1):30100. doi: 10.1038/s41598-025-15068-8.
Feature selection (FS) is especially important for high-dimensional data. In this paper, we propose an efficient and interactive feature selection approach based on copula entropy (CEFS+). The method combines feature-feature mutual information with feature-label mutual information and uses a maximum correlation minimum redundancy strategy for greedy selection. The approach uses copula entropy as a measure of feature relevance that captures the full-order interaction gain between features. Moreover, we prove the divisibility of multivariate mutual information, and derive a novel feature criterion, and propose a feature selection approach based on copula entropy called CEFS. Meanwhile, to overcome the instability of the CEFS method on some datasets, we propose the improved method CEFS+ which based on the rank technique. Finally, we evaluate the effectiveness of CEFS and CEFS+ using three classifiers on five datasets. In 10 out of 15 scenarios, our approach obtains the highest classification accuracy, which is much higher than the other six commonly used FS methods. In particular, our approach performs better on high-dimensional genetic datasets.
特征选择(FS)对于高维数据尤为重要。在本文中,我们提出了一种基于copula熵的高效交互式特征选择方法(CEFS+)。该方法将特征-特征互信息与特征-标签互信息相结合,并采用最大相关最小冗余策略进行贪心选择。该方法使用copula熵作为特征相关性的度量,以捕获特征之间的全阶交互增益。此外,我们证明了多元互信息的可分性,推导了一种新的特征准则,并提出了一种基于copula熵的特征选择方法CEFS。同时,为了克服CEFS方法在某些数据集上的不稳定性,我们提出了基于排序技术的改进方法CEFS+。最后,我们使用三个分类器在五个数据集上评估了CEFS和CEFS+的有效性。在15个场景中的10个场景中,我们的方法获得了最高的分类准确率,远高于其他六种常用的FS方法。特别是,我们的方法在高维遗传数据集上表现更好。