Wang Xiuzhe
School of Foreign Languages, Zhengzhou College of Finance and Economics, Zhengzhou, Henan, China.
PeerJ Comput Sci. 2024 Nov 13;10:e2479. doi: 10.7717/peerj-cs.2479. eCollection 2024.
Anomalies are the existential abnormalities in data, the identification of which is known as anomaly detection. The absence of timely detection of anomalies may affect the key processes of decision-making, fraud detection, and automated classification. Most of the existing models of anomaly detection utilize the traditional way of tokenizing and are computationally costlier, mainly if the outliers are to be extracted from a large script. This research work intends to propose an unsupervised, all-MiniLM-L6-v2-based system for the detection of outliers. The method makes use of centroid embeddings to extract outliers in high-variety, large-volume data. To avoid mistakenly treating novelty as an outlier, the Minimum Covariance Determinant (MCD) based approach is followed to count the novelty of the input script. The proposed method is implemented in a Python project, App. for Anomalies Detection (AAD). The system is evaluated by two non-related datasets-the 20 newsgroups text dataset and the SMS spam collection dataset. The robust accuracy (94%) and F1 score (0.95) revealed that the proposed method could effectively trace anomalies in a comparatively large script. The process is applicable in extracting meanings from textual data, particularly in the domains of human resource management and security.
异常是数据中存在的异常情况,对其进行识别被称为异常检测。未能及时检测到异常可能会影响决策、欺诈检测和自动分类等关键流程。大多数现有的异常检测模型采用传统的分词方式,计算成本较高,尤其是要从大型文本中提取异常值时。这项研究工作旨在提出一种基于无监督的、全MiniLM-L6-v2的异常值检测系统。该方法利用质心嵌入在高多样性、大容量数据中提取异常值。为避免将新颖性误判为异常值,采用基于最小协方差行列式(MCD)的方法来计算输入文本的新颖性。所提出的方法在一个名为“异常检测应用程序(AAD)”的Python项目中实现。该系统通过两个不相关的数据集进行评估——20个新闻组文本数据集和短信垃圾邮件收集数据集。稳健准确率(94%)和F1分数(0.95)表明,所提出的方法能够在相对较大的文本中有效地追踪异常。该过程适用于从文本数据中提取含义,特别是在人力资源管理和安全领域。