Suppr超能文献

EAD:轻松异常检测,一种基于深度学习的用于检测英文文本数据中异常值的方法。

EAD: effortless anomalies detection, a deep learning based approach for detecting outliers in English textual data.

作者信息

Wang Xiuzhe

机构信息

School of Foreign Languages, Zhengzhou College of Finance and Economics, Zhengzhou, Henan, China.

出版信息

PeerJ Comput Sci. 2024 Nov 13;10:e2479. doi: 10.7717/peerj-cs.2479. eCollection 2024.

Abstract

Anomalies are the existential abnormalities in data, the identification of which is known as anomaly detection. The absence of timely detection of anomalies may affect the key processes of decision-making, fraud detection, and automated classification. Most of the existing models of anomaly detection utilize the traditional way of tokenizing and are computationally costlier, mainly if the outliers are to be extracted from a large script. This research work intends to propose an unsupervised, all-MiniLM-L6-v2-based system for the detection of outliers. The method makes use of centroid embeddings to extract outliers in high-variety, large-volume data. To avoid mistakenly treating novelty as an outlier, the Minimum Covariance Determinant (MCD) based approach is followed to count the novelty of the input script. The proposed method is implemented in a Python project, App. for Anomalies Detection (AAD). The system is evaluated by two non-related datasets-the 20 newsgroups text dataset and the SMS spam collection dataset. The robust accuracy (94%) and F1 score (0.95) revealed that the proposed method could effectively trace anomalies in a comparatively large script. The process is applicable in extracting meanings from textual data, particularly in the domains of human resource management and security.

摘要

异常是数据中存在的异常情况,对其进行识别被称为异常检测。未能及时检测到异常可能会影响决策、欺诈检测和自动分类等关键流程。大多数现有的异常检测模型采用传统的分词方式,计算成本较高,尤其是要从大型文本中提取异常值时。这项研究工作旨在提出一种基于无监督的、全MiniLM-L6-v2的异常值检测系统。该方法利用质心嵌入在高多样性、大容量数据中提取异常值。为避免将新颖性误判为异常值,采用基于最小协方差行列式(MCD)的方法来计算输入文本的新颖性。所提出的方法在一个名为“异常检测应用程序(AAD)”的Python项目中实现。该系统通过两个不相关的数据集进行评估——20个新闻组文本数据集和短信垃圾邮件收集数据集。稳健准确率(94%)和F1分数(0.95)表明,所提出的方法能够在相对较大的文本中有效地追踪异常。该过程适用于从文本数据中提取含义,特别是在人力资源管理和安全领域。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bc33/11623099/ab0dae24ea34/peerj-cs-10-2479-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验