一种基于信息熵加权子空间的高维数据集成离群点检测方法。

An Ensemble Outlier Detection Method Based on Information Entropy-Weighted Subspaces for High-Dimensional Data.

作者信息

Li Zihao, Zhang Liumei

机构信息

School of Computing, Xi'an Shiyou University, Xi'an 710065, China.

出版信息

Entropy (Basel). 2023 Aug 9;25(8):1185. doi: 10.3390/e25081185.

DOI:10.3390/e25081185

PMID:37628215

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10453693/

Abstract

Outlier detection is an important task in the field of data mining and a highly active area of research in machine learning. In industrial automation, datasets are often high-dimensional, meaning an effort to study all dimensions directly leads to data sparsity, thus causing outliers to be masked by noise effects in high-dimensional spaces. The "curse of dimensionality" phenomenon renders many conventional outlier detection methods ineffective. This paper proposes a new outlier detection algorithm called EOEH (Ensemble Outlier Detection Method Based on Information Entropy-Weighted Subspaces for High-Dimensional Data). First, random secondary subsampling is performed on the data, and detectors are run on various small-scale sub-samples to provide diverse detection results. Results are then aggregated to reduce the global variance and enhance the robustness of the algorithm. Subsequently, information entropy is utilized to construct a dimension-space weighting method that can discern the influential factors within different dimensional spaces. This method generates weighted subspaces and dimensions for data objects, reducing the impact of noise created by high-dimensional data and improving high-dimensional data detection performance. Finally, this study offers a design for a new high-precision local outlier factor (HPLOF) detector that amplifies the differentiation between normal and outlier data, thereby improving the detection performance of the algorithm. The feasibility of this algorithm is validated through experiments that used both simulated and UCI datasets. In comparison to popular outlier detection algorithms, our algorithm demonstrates a superior detection performance and runtime efficiency. Compared with the current popular, common algorithms, the EOEH algorithm improves the detection performance by 6% on average. In terms of running time for high-dimensional data, EOEH is 20% faster than the current popular algorithms.

摘要

异常值检测是数据挖掘领域的一项重要任务，也是机器学习中一个高度活跃的研究领域。在工业自动化中，数据集通常是高维的，这意味着直接研究所有维度会导致数据稀疏，从而使异常值被高维空间中的噪声效应所掩盖。“维度诅咒”现象使许多传统的异常值检测方法失效。本文提出了一种新的异常值检测算法，称为EOEH（基于信息熵加权子空间的高维数据集成异常值检测方法）。首先，对数据进行随机二次子采样，并在各种小规模子样本上运行检测器，以提供多样化的检测结果。然后汇总结果以降低全局方差并增强算法的鲁棒性。随后，利用信息熵构建一种维度空间加权方法，该方法可以识别不同维度空间内的影响因素。此方法为数据对象生成加权子空间和维度，减少高维数据产生的噪声影响，提高高维数据检测性能。最后，本研究提供了一种新的高精度局部异常因子（HPLOF）检测器的设计，该检测器放大了正常数据和异常数据之间的差异，从而提高了算法的检测性能。通过使用模拟数据集和UCI数据集的实验验证了该算法的可行性。与流行的异常值检测算法相比，我们的算法具有更高的检测性能和运行时效率。与当前流行的常见算法相比，EOEH算法的检测性能平均提高了6%。在高维数据的运行时间方面，EOEH比当前流行算法快20%。