EAD：轻松异常检测，一种基于深度学习的用于检测英文文本数据中异常值的方法。

EAD: effortless anomalies detection, a deep learning based approach for detecting outliers in English textual data.

作者信息

Wang Xiuzhe

机构信息

School of Foreign Languages, Zhengzhou College of Finance and Economics, Zhengzhou, Henan, China.

出版信息

PeerJ Comput Sci. 2024 Nov 13;10:e2479. doi: 10.7717/peerj-cs.2479. eCollection 2024.

DOI:10.7717/peerj-cs.2479

PMID:39650354

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11623099/

Abstract

Anomalies are the existential abnormalities in data, the identification of which is known as anomaly detection. The absence of timely detection of anomalies may affect the key processes of decision-making, fraud detection, and automated classification. Most of the existing models of anomaly detection utilize the traditional way of tokenizing and are computationally costlier, mainly if the outliers are to be extracted from a large script. This research work intends to propose an unsupervised, all-MiniLM-L6-v2-based system for the detection of outliers. The method makes use of centroid embeddings to extract outliers in high-variety, large-volume data. To avoid mistakenly treating novelty as an outlier, the Minimum Covariance Determinant (MCD) based approach is followed to count the novelty of the input script. The proposed method is implemented in a Python project, App. for Anomalies Detection (AAD). The system is evaluated by two non-related datasets-the 20 newsgroups text dataset and the SMS spam collection dataset. The robust accuracy (94%) and F1 score (0.95) revealed that the proposed method could effectively trace anomalies in a comparatively large script. The process is applicable in extracting meanings from textual data, particularly in the domains of human resource management and security.

摘要

异常是数据中存在的异常情况，对其进行识别被称为异常检测。未能及时检测到异常可能会影响决策、欺诈检测和自动分类等关键流程。大多数现有的异常检测模型采用传统的分词方式，计算成本较高，尤其是要从大型文本中提取异常值时。这项研究工作旨在提出一种基于无监督的、全MiniLM-L6-v2的异常值检测系统。该方法利用质心嵌入在高多样性、大容量数据中提取异常值。为避免将新颖性误判为异常值，采用基于最小协方差行列式（MCD）的方法来计算输入文本的新颖性。所提出的方法在一个名为“异常检测应用程序（AAD）”的Python项目中实现。该系统通过两个不相关的数据集进行评估——20个新闻组文本数据集和短信垃圾邮件收集数据集。稳健准确率（94%）和F1分数（0.95）表明，所提出的方法能够在相对较大的文本中有效地追踪异常。该过程适用于从文本数据中提取含义，特别是在人力资源管理和安全领域。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bc33/11623099/ab0dae24ea34/peerj-cs-10-2479-g001.jpg

相似文献

EAD: effortless anomalies detection, a deep learning based approach for detecting outliers in English textual data.EAD：轻松异常检测，一种基于深度学习的用于检测英文文本数据中异常值的方法。

PeerJ Comput Sci. 2024 Nov 13;10:e2479. doi: 10.7717/peerj-cs.2479. eCollection 2024.

A flexible framework for anomaly Detection via dimensionality reduction.一种通过降维进行异常检测的灵活框架。

Neural Comput Appl. 2023;35(2):1157-1167. doi: 10.1007/s00521-021-05839-5. Epub 2021 Mar 11.

Smart data-driven medical decisions through collective and individual anomaly detection in healthcare time series.

Int J Med Inform. 2025 Feb;194:105696. doi: 10.1016/j.ijmedinf.2024.105696. Epub 2024 Nov 17.

A robust variational autoencoder using beta divergence.一种使用贝塔散度的稳健变分自编码器。

Knowl Based Syst. 2022 Feb 28;238. doi: 10.1016/j.knosys.2021.107886. Epub 2021 Dec 10.

Data-driven evolution of water quality models: An in-depth investigation of innovative outlier detection approaches-A case study of Irish Water Quality Index (IEWQI) model.水质模型的数据驱动演变：创新异常值检测方法的深入研究——以爱尔兰水质指数（IEWQI）模型为例

Water Res. 2024 May 15;255:121499. doi: 10.1016/j.watres.2024.121499. Epub 2024 Mar 20.

Machine learning based hybrid anomaly detection technique for automatic diagnosis of cardiovascular diseases using cardiac sympathetic nerve activity and electrocardiogram.基于机器学习的混合异常检测技术，用于使用心脏交感神经活动和心电图自动诊断心血管疾病。

Biomed Tech (Berl). 2023 Oct 12;69(1):79-109. doi: 10.1515/bmt-2022-0406. Print 2024 Feb 26.

Machine learning-based anomaly detection of groundwater microdynamics: case study of Chengdu, China.基于机器学习的地下水微动力学异常检测：中国成都的案例研究

Sci Rep. 2023 Sep 7;13(1):14718. doi: 10.1038/s41598-023-38447-5.

BRAIN LESION DETECTION USING A ROBUST VARIATIONAL AUTOENCODER AND TRANSFER LEARNING.使用鲁棒变分自编码器和迁移学习进行脑损伤检测

Proc IEEE Int Symp Biomed Imaging. 2020 Apr;2020:786-790. doi: 10.1109/isbi45749.2020.9098405. Epub 2020 May 22.

Optimization of Model Training Based on Iterative Minimum Covariance Determinant In Motor-Imagery BCI.基于迭代最小协方差判定的运动想象脑-机接口模型训练优化。

Int J Neural Syst. 2021 Jul;31(7):2150030. doi: 10.1142/S0129065721500301. Epub 2021 Jun 26.

Unsupervised Anomaly Detection in Stream Data with Online Evolving Spiking Neural Networks.基于在线进化尖峰神经网络的流数据无监督异常检测。

Neural Netw. 2021 Jul;139:118-139. doi: 10.1016/j.neunet.2021.02.017. Epub 2021 Feb 25.

本文引用的文献

Time Series Anomaly Detection Model Based on Multi-Features.基于多特征的时间序列异常检测模型。

Comput Intell Neurosci. 2022 Aug 8;2022:2371549. doi: 10.1155/2022/2371549. eCollection 2022.

Ensemble Neuroevolution-Based Approach for Multivariate Time Series Anomaly Detection.基于集成神经进化的多元时间序列异常检测方法

Entropy (Basel). 2021 Nov 6;23(11):1466. doi: 10.3390/e23111466.

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation.马修斯相关系数（MCC）在二分类评估中优于 F1 得分和准确率的优势。

BMC Genomics. 2020 Jan 2;21(1):6. doi: 10.1186/s12864-019-6413-7.

t-Distributed Stochastic Neighbor Embedding (t-SNE): A tool for eco-physiological transcriptomic analysis.t分布随机邻域嵌入（t-SNE）：一种用于生态生理转录组分析的工具。

Mar Genomics. 2020 Jun;51:100723. doi: 10.1016/j.margen.2019.100723. Epub 2019 Nov 26.

Challenges of Big Data Analysis.大数据分析的挑战

Natl Sci Rev. 2014 Jun;1(2):293-314. doi: 10.1093/nsr/nwt032.

MedMon: securing medical devices through wireless monitoring and anomaly detection.MedMon：通过无线监测和异常检测来保障医疗设备的安全。

IEEE Trans Biomed Circuits Syst. 2013 Dec;7(6):871-81. doi: 10.1109/TBCAS.2013.2245664.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

EAD：轻松异常检测，一种基于深度学习的用于检测英文文本数据中异常值的方法。

EAD: effortless anomalies detection, a deep learning based approach for detecting outliers in English textual data.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献