Suppr超能文献

基于熵的网格方法处理异常值:以环境监测数据为例。

Entropy-based grid approach for handling outliers: a case study to environmental monitoring data.

机构信息

National University of Computer and Emerging Sciences, Karachi, Pakistan.

College of Computer Science and Technology, Northeastern University Shenyang, Shenyang, China.

出版信息

Environ Sci Pollut Res Int. 2023 Dec;30(60):125138-125157. doi: 10.1007/s11356-023-26780-1. Epub 2023 Jun 12.

Abstract

Grid-based approaches render an efficient framework for data clustering in the presence of incomplete, inexplicit, and uncertain data. This paper proposes an entropy-based grid approach (EGO) for outlier detection in clustered data. The given hard clusters obtained from a hard clustering algorithm, EGO uses entropy on the dataset as a whole or on an individual cluster to detect outliers. EGO works in two steps: explicit outlier detection and implicit outlier detection. Explicit outlier detection is concerned with those data points that are isolated in the grid cells. They are either far from the dense region or maybe a nearby isolated data point and therefore declared as an explicit outlier. Implicit outlier detection is associated with the detection of outliers that are perplexedly deviated from the normal pattern. The determination of such outliers is achieved using entropy change of the dataset or a specific cluster for each deviation. The elbow based on the trade-off between entropy and object geometries optimizes the outlier detection process. Experimental results on CHAMELEON datasets and other similar datasets suggested that the proposed approach(es) detect the outliers more precisely and extend the capability of outliers detection to an additional 4.5% to 8.6%. Moreover, the resultant clusters became more precise and compact when the entropy-based gridding approach is applied on top of hard clustering algorithms. The performance of the proposed algorithms is compared with well-known outlier detection algorithms, including DBSCAN, HDBSCAN, RE3WC, LOF, LoOP, ABOD, CBLOF and HBOS. Finally, a case study for detecting outliers in environmental data has been carried out using the proposed approach and results are generated on our synthetically prepared datasets. The performance shows that the proposed approach may be an industrial-oriented solution to outlier detection in environmental monitoring data.

摘要

基于网格的方法为处理不完整、不明确和不确定数据的聚类提供了一种有效的框架。本文提出了一种基于熵的网格方法(EGO),用于检测聚类数据中的异常值。该方法使用硬聚类算法获得的硬聚类,通过数据集或单个聚类的熵来检测异常值。EGO 分两步工作:显式异常值检测和隐式异常值检测。显式异常值检测关注那些在网格单元中孤立的数据点。它们要么远离密集区域,要么是附近的孤立数据点,因此被宣布为显式异常值。隐式异常值检测与检测异常值有关,这些异常值从正常模式中困惑地偏离。通过数据集或每个偏差的特定聚类的熵变化来确定此类异常值。基于熵和对象几何形状之间的权衡的肘部优化了异常值检测过程。在 CHAMELEON 数据集和其他类似数据集上的实验结果表明,所提出的方法可以更准确地检测异常值,并将异常值检测的能力扩展到额外的 4.5%至 8.6%。此外,当基于熵的网格方法应用于硬聚类算法之上时,所得聚类变得更加精确和紧凑。将所提出的算法与包括 DBSCAN、HDBSCAN、RE3WC、LOF、LoOP、ABOD、CBLOF 和 HBOS 在内的知名异常值检测算法进行了比较。最后,使用所提出的方法对环境数据中的异常值进行了案例研究,并在我们的综合准备数据集上生成了结果。性能表明,该方法可能是环境监测数据中异常值检测的一种面向工业的解决方案。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验