Suppr超能文献

设计一种用于数据挖掘中异常值检测的流算法-一种增量方法。

Designing a Streaming Algorithm for Outlier Detection in Data Mining-An Incrementa Approach.

机构信息

School of Computer Science, Carleton University, Ottawa, ON K1S 5B6, Canada.

School of Information Technology, Carleton University, Ottawa, ON K1S 5B6, Canada.

出版信息

Sensors (Basel). 2020 Feb 26;20(5):1261. doi: 10.3390/s20051261.

Abstract

To design an algorithm for detecting outliers over streaming data has become an important task in many common applications, arising in areas such as fraud detections, network analysis, environment monitoring and so forth. Due to the fact that real-time data may arrive in the form of streams rather than batches, properties such as concept drift, temporal context, transiency, and uncertainty need to be considered. In addition, data processing needs to be incremental with limited memory resource, and scalable. These facts create big challenges for existing outlier detection algorithms in terms of their accuracies when they are implemented in an incremental fashion, especially in the streaming environment. To address these problems, we first propose C_KDE_WR, which uses and to process the streaming data online, and reports its results demonstrating high throughput on handling real-time streaming data, implemented in a CUDA framework on Graphics Processing Unit (GPU). We also present another algorithm, C_LOF, based on a very popular and effective outlier detection algorithm called Local Outlier Factor (LOF) which unfortunately works only on batched data. Using a novel incremental approach that compensates the drawback of high complexity in LOF, we show how to implement it in a streaming context and to obtain results in a timely manner. Like C_KDE_WR, C_LOF also employs sliding-window and to help making decision based on the data in the current window. It also addresses all those challenges of streaming data as addressed in C_KDE_WR. In addition, we report the comparative evaluation on the accuracy of C_KDE_WR with the state-of-the-art SOD_GPU using Precision, Recall and F-score metrics. Furthermore, a t-test is also performed to demonstrate the significance of the improvement. We further report the testing results of C_LOF on different parameter settings and drew ROC and PR curve with their area under the curve (AUC) and Average Precision (AP) values calculated respectively. Experimental results show that C_LOF can overcome the problem, which often exists in outlier detection on streaming data. We provide complexity analysis and report experiment results on the accuracy of both C_KDE_WR and C_LOF algorithms in order to evaluate their effectiveness as well as their efficiencies.

摘要

设计用于检测流数据中异常值的算法已成为许多常见应用程序中的一项重要任务,这些应用程序出现在欺诈检测、网络分析、环境监测等领域。由于实时数据可能以流的形式而不是批处理的形式到达,因此需要考虑概念漂移、时间上下文、瞬态和不确定性等属性。此外,数据处理需要具有增量性和有限的内存资源,并且可扩展。这些事实给现有的异常值检测算法在以增量方式实现时的准确性方面带来了巨大挑战,尤其是在流环境中。为了解决这些问题,我们首先提出了 C_KDE_WR,它使用 和 在线处理流数据,并报告其结果,展示了在图形处理单元 (GPU) 上的 CUDA 框架中处理实时流数据的高吞吐量。我们还提出了另一种算法 C_LOF,它基于一种非常流行且有效的异常值检测算法,称为局部离群因子 (LOF),但不幸的是,它仅适用于批处理数据。我们使用一种新颖的增量方法来弥补 LOF 中高复杂度的缺点,展示了如何在流上下文中实现它,并及时获得结果。与 C_KDE_WR 一样,C_LOF 还采用滑动窗口和 来帮助根据当前窗口中的数据做出决策。它还解决了 C_KDE_WR 中解决的所有流数据挑战。此外,我们报告了使用精度、召回率和 F 分数指标对 C_KDE_WR 与最先进的 SOD_GPU 的准确性进行的比较评估。此外,还进行了 t 检验以证明改进的重要性。我们还报告了 C_LOF 在不同参数设置下的测试结果,并分别绘制了 ROC 和 PR 曲线及其计算的曲线下面积 (AUC) 和平均精度 (AP) 值。实验结果表明,C_LOF 可以克服流数据中异常值检测中经常存在的 问题。我们提供了复杂性分析,并报告了 C_KDE_WR 和 C_LOF 算法的准确性实验结果,以评估它们的有效性和效率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/196f/7085525/28dc39754a51/sensors-20-01261-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验