Chan Ho Tung Jeremy, Veas Eduardo
Interactive System and Data Science, Graz University of Technology, 8010, Graz, Austria.
Human AI Interaction, Know Center GmbH, 8010, Graz, Austria.
Sci Rep. 2024 Oct 9;14(1):23532. doi: 10.1038/s41598-024-72640-4.
Understanding what is important and redundant within data can improve the modelling process of neural networks by reducing unnecessary model complexity, training time and memory storage. This information is however not always priorly available nor trivial to obtain from neural networks. There are existing feature selection methods which utilise the internal working of a neural network for selection, however further analysis and interpretation of the input features' significance is often limiting. We propose an approach that offers an extension that estimates the significance of features by analysing the gradient descent of a pairwise layer within a model. The changes that occur with the weights and gradients throughout training provide a profile that can be used to better understand the importance hierarchy between the features for ranking and feature selection. Additionally, this method is transferable to existing fully or partially trained models, which is beneficial for understanding existing or active models. The proposed approach is demonstrated empirically with a study which uses benchmark datasets from libraries such as MNIST and scikit-feat, as well as a simulated dataset and an applied real world dataset. This is verified with the ground truth where available, and if not, via a comparison of fundamental feature selection methods, which includes existing statistical based and embedded neural network based feature selection methods through the methodology of Reduce and Retrain.
了解数据中哪些是重要的、哪些是冗余的,能够通过降低不必要的模型复杂度、训练时间和内存存储,来改进神经网络的建模过程。然而,这些信息并非总是事先可得,也并非轻易就能从神经网络中获取。现有的特征选择方法利用神经网络的内部工作机制进行选择,但是对输入特征重要性的进一步分析和解释往往具有局限性。我们提出了一种方法,它提供了一种扩展,通过分析模型中两两层之间的梯度下降来估计特征的重要性。在整个训练过程中权重和梯度的变化提供了一个概况,可用于更好地理解特征之间的重要性层次结构,以进行排序和特征选择。此外,该方法可转移到现有的完全或部分训练的模型中,这有助于理解现有或活跃的模型。通过一项研究对所提出的方法进行了实证验证,该研究使用了来自MNIST和scikit-feat等库的基准数据集,以及一个模拟数据集和一个实际应用数据集。在有可用的真实情况时进行验证,若没有,则通过与基本特征选择方法进行比较来验证,这些方法包括通过“约简与再训练”方法的现有基于统计和基于嵌入式神经网络的特征选择方法。