Institute of Clinical Pharmacology, Goethe - University, Theodor - Stern - Kai 7, 60590, Frankfurt am Main, Germany.
Fraunhofer Institute for Translational Medicine and Pharmacology ITMP, Theodor - Stern - Kai 7, 60596, Frankfurt am Main, Germany.
Sci Rep. 2023 Apr 4;13(1):5470. doi: 10.1038/s41598-023-32396-9.
Selecting the k best features is a common task in machine learning. Typically, a few features have high importance, but many have low importance (right-skewed distribution). This report proposes a numerically precise method to address this skewed feature importance distribution in order to reduce a feature set to the informative minimum of items. Computed ABC analysis (cABC) is an item categorization method that aims to identify the most important items by partitioning a set of non-negative numerical items into subsets "A", "B", and "C" such that subset "A" contains the "few important" items based on specific properties of ABC curves defined by their relationship to Lorenz curves. In its recursive form, the cABC analysis can be applied again to subset "A". A generic image dataset and three biomedical datasets (lipidomics and two genomics datasets) with a large number of variables were used to perform the experiments. The experimental results show that the recursive cABC analysis limits the dimensions of the data projection to a minimum where the relevant information is still preserved and directs the feature selection in machine learning to the most important class-relevant information, including filtering feature sets for nonsense variables. Feature sets were reduced to 10% or less of the original variables and still provided accurate classification in data not used for feature selection. cABC analysis, in its recursive variant, provides a computationally precise means of reducing information to a minimum. The minimum is the result of a computation of the number of k most relevant items, rather than a decision to select the k best items from a list. In addition, there are precise criteria for stopping the reduction process. The reduction to the most important features can improve the human understanding of the properties of the data set. The cABC method is implemented in the Python package "cABCanalysis" available at https://pypi.org/project/cABCanalysis/ .
选择 k 个最佳特征是机器学习中的一项常见任务。通常,少数特征具有重要性,而许多特征则具有低重要性(右偏分布)。本报告提出了一种数值精确的方法来解决这种偏斜的特征重要性分布,以便将特征集减少到信息量最小的项目。计算 ABC 分析(cABC)是一种项目分类方法,旨在通过将一组非负数值项目划分为子集“A”、“B”和“C”来识别最重要的项目,使得子集“A”包含根据 ABC 曲线的特定性质定义的“少数重要”项目,这些性质与其与 Lorenz 曲线的关系有关。在其递归形式中,可以再次将 cABC 分析应用于子集“A”。使用了一个通用的图像数据集和三个具有大量变量的生物医学数据集(脂质组学和两个基因组学数据集)来进行实验。实验结果表明,递归 cABC 分析将数据投影的维度限制到最小,在该维度下仍然保留了相关信息,并将机器学习中的特征选择引导到最重要的与类别相关的信息,包括过滤掉无意义的变量。特征集减少到原始变量的 10%或更少,并且在未用于特征选择的数据中仍然提供了准确的分类。cABC 分析在其递归变体中提供了一种精确的计算方法来将信息减少到最小。最小值是计算 k 个最相关项目数量的结果,而不是从列表中选择 k 个最佳项目的决定。此外,还有用于停止减少过程的精确标准。减少到最重要的特征可以提高对数据集性质的理解。cABC 方法在 Python 包“cABCanalysis”中实现,可在 https://pypi.org/project/cABCanalysis/ 获得。