Zheng Jian, Qu Hongchun, Li Zhaoni, Li Lin, Tang Xiaoming, Guo Fei
College of Computer Science and Technology, Chongqing University of Post and Telecommunications, Chongqing, China.
College of Automation, Chongqing University of Posts and Telecommunications, Chongqing, China.
PeerJ Comput Sci. 2022 Aug 11;8:e1061. doi: 10.7717/peerj-cs.1061. eCollection 2022.
Feature extraction often needs to rely on sufficient information of the input data, however, the distribution of the data upon a high-dimensional space is too sparse to provide sufficient information for feature extraction. Furthermore, high dimensionality of the data also creates trouble for the searching of those features scattered in subspaces. As such, it is a tricky task for feature extraction from the data upon a high-dimensional space. To address this issue, this article proposes a novel autoencoder method using Mahalanobis distance metric of rescaling transformation. The key idea of the method is that by implementing Mahalanobis distance metric of rescaling transformation, the difference between the reconstructed distribution and the original distribution can be reduced, so as to improve the ability of feature extraction to the autoencoder. Results show that the proposed approach wins the state-of-the-art methods in terms of both the accuracy of feature extraction and the linear separabilities of the extracted features. We indicate that distance metric-based methods are more suitable for extracting those features with linear separabilities from high-dimensional data than feature selection-based methods. In a high-dimensional space, evaluating feature similarity is relatively easier than evaluating feature importance, so that distance metric methods by evaluating feature similarity gain advantages over feature selection methods by assessing feature importance for feature extraction, while evaluating feature importance is more computationally efficient than evaluating feature similarity.
特征提取通常需要依赖输入数据的充分信息,然而,高维空间中数据的分布过于稀疏,无法为特征提取提供足够的信息。此外,数据的高维性也给搜索分散在子空间中的那些特征带来了麻烦。因此,从高维空间中的数据进行特征提取是一项棘手的任务。为了解决这个问题,本文提出了一种使用重缩放变换的马氏距离度量的新型自动编码器方法。该方法的关键思想是,通过实现重缩放变换的马氏距离度量,可以减少重构分布与原始分布之间的差异,从而提高自动编码器的特征提取能力。结果表明,所提出的方法在特征提取的准确性和提取特征的线性可分性方面都优于现有方法。我们指出,基于距离度量的方法比基于特征选择的方法更适合从高维数据中提取具有线性可分性的那些特征。在高维空间中,评估特征相似性相对比评估特征重要性更容易,因此通过评估特征相似性的距离度量方法在特征提取方面比通过评估特征重要性的特征选择方法更具优势,而评估特征重要性比评估特征相似性在计算上更高效。