Estrela Gustavo, Gubitoso Marco Dimas, Ferreira Carlos Eduardo, Barrera Junior, Reis Marcelo S
Center of Toxins, Immune-Response and Cell Signaling (CeTICS), Laboratório de Ciclo Celular, Instituto Butantan, Butantã, São Paulo-SP 05503-900, Brazil.
Instituto de Matemática e Estatística, Universidade de São Paulo, São Paulo-SP 05503-900, Brazil.
Entropy (Basel). 2020 Apr 24;22(4):492. doi: 10.3390/e22040492.
In Machine Learning, feature selection is an important step in classifier design. It consists of finding a subset of features that is optimum for a given cost function. One possibility to solve feature selection is to organize all possible feature subsets into a Boolean lattice and to exploit the fact that the costs of chains in that lattice describe U-shaped curves. Minimization of such cost function is known as the U-curve problem. Recently, a study proposed U-Curve Search (UCS), an optimal algorithm for that problem, which was successfully used for feature selection. However, despite of the algorithm optimality, the UCS required time in computational assays was exponential on the number of features. Here, we report that such scalability issue arises due to the fact that the U-curve problem is NP-hard. In the sequence, we introduce the Parallel U-Curve Search (PUCS), a new algorithm for the U-curve problem. In PUCS, we present a novel way to partition the search space into smaller Boolean lattices, thus rendering the algorithm highly parallelizable. We also provide computational assays with both synthetic data and Machine Learning datasets, where the PUCS performance was assessed against UCS and other golden standard algorithms in feature selection.
在机器学习中,特征选择是分类器设计中的重要一步。它包括找到对于给定代价函数而言最优的特征子集。解决特征选择的一种可能性是将所有可能的特征子集组织成一个布尔格,并利用该格中链的代价描述U形曲线这一事实。最小化此类代价函数被称为U曲线问题。最近,一项研究提出了U曲线搜索(UCS),这是针对该问题的一种最优算法,它已成功用于特征选择。然而,尽管该算法具有最优性,但在计算分析中UCS所需的时间与特征数量呈指数关系。在此,我们报告这种可扩展性问题是由于U曲线问题是NP难问题这一事实导致的。接着,我们介绍并行U曲线搜索(PUCS),这是一种针对U曲线问题的新算法。在PUCS中,我们提出了一种将搜索空间划分为更小布尔格的新颖方法,从而使该算法具有高度可并行性。我们还提供了针对合成数据和机器学习数据集的计算分析,在这些分析中,将PUCS的性能与UCS以及特征选择中的其他黄金标准算法进行了评估。