Ngo Nicolas, Michel Pierre, Giorgi Roch
Aix Marseille Univ, Inserm, IRD, SESSTIM, Sciences Économiques & Sociales de la Santé & Traitement de l'Information Médicale, ISSPAM, Marseille, France.
Aix Marseille Univ, CNRS, AMSE, Aix-Marseille School of Economics, Marseille, France.
BMC Med Res Methodol. 2024 Dec 19;24(1):307. doi: 10.1186/s12874-024-02426-9.
The -metric value is generally used as the importance score of a feature (or a set of features) in a classification context. This study aimed to go further by creating a new methodology for multivariate feature selection for classification, whereby the -metric is associated with a specific search direction (and therefore a specific stopping criterion). As three search directions are used, we effectively created three distinct methods.
We assessed the performance of our new methodology through a simulation study, comparing them against more conventional methods. Classification performance indicators, number of selected features, stability and execution time were used to evaluate the performance of the methods. We also evaluated how well the proposed methodology selected relevant features for the detection of atrial fibrillation, which is a cardiac arrhythmia.
We found that in the simulation study as well as the detection of AF task, our methods were able to select informative features and maintain a good level of predictive performance; however in a case of strong correlation and large datasets, the -metric based methods were less efficient to exclude non-informative features.
Results highlighted a good combination of both the forward search direction and the -metric as an evaluation function. However, using the backward search direction, the feature selection algorithm could fall into a local optima and can be improved.
在分类背景下,-度量值通常用作特征(或一组特征)的重要性得分。本研究旨在通过创建一种用于分类的多变量特征选择新方法进一步深入研究,其中-度量与特定搜索方向(因此也是特定停止标准)相关联。由于使用了三种搜索方向,我们有效地创建了三种不同的方法。
我们通过模拟研究评估了新方法的性能,并将其与更传统的方法进行比较。使用分类性能指标、所选特征数量、稳定性和执行时间来评估这些方法的性能。我们还评估了所提出的方法在检测心房颤动(一种心律失常)方面选择相关特征的效果如何。
我们发现,在模拟研究以及房颤检测任务中,我们的方法能够选择信息丰富的特征并保持良好的预测性能水平;然而,在强相关性和大数据集的情况下,基于-度量的方法在排除非信息性特征方面效率较低。
结果突出了前向搜索方向和作为评估函数的-度量的良好组合。然而,使用后向搜索方向时,特征选择算法可能会陷入局部最优,并且可以改进。