School of Computer and Communication Engineering, Hunan Provincial Key Laboratory of Intelligent Processing of Big Data on Transportation, Changsha University of Science and Technology, Changsha 410114, China.
Key Lab of Broadband Wireless Communication and Sensor Network Technology (Nanjing University of Posts and Telecommunications), Ministry of Education, Nanjing 210003, China.
Sensors (Basel). 2020 Apr 26;20(9):2451. doi: 10.3390/s20092451.
Log anomaly detection is an efficient method to manage modern large-scale Internet of Things (IoT) systems. More and more works start to apply natural language processing (NLP) methods, and in particular word2vec, in the log feature extraction. Word2vec can extract the relevance between words and vectorize the words. However, the computing cost of training word2vec is high. Anomalies in logs are dependent on not only an individual log message but also on the log message sequence. Therefore, the vector of words from word2vec can not be used directly, which needs to be transformed into the vector of log events and further transformed into the vector of log sequences. To reduce computational cost and avoid multiple transformations, in this paper, we propose an offline feature extraction model, named LogEvent2vec, which takes the log event as input of word2vec to extract the relevance between log events and vectorize log events directly. LogEvent2vec can work with any coordinate transformation methods and anomaly detection models. After getting the log event vector, we transform log event vector to log sequence vector by bary or tf-idf and three kinds of supervised models (Random Forests, Naive Bayes, and Neural Networks) are trained to detect the anomalies. We have conducted extensive experiments on a real public log dataset from BlueGene/L (BGL). The experimental results demonstrate that LogEvent2vec can significantly reduce computational time by 30 times and improve accuracy, comparing with word2vec. LogEvent2vec with bary and Random Forest can achieve the best F1-score and LogEvent2vec with tf-idf and Naive Bayes needs the least computational time.
日志异常检测是管理现代大规模物联网 (IoT) 系统的有效方法。越来越多的工作开始将自然语言处理 (NLP) 方法,特别是 word2vec,应用于日志特征提取中。Word2vec 可以提取单词之间的相关性并将单词向量化。然而,训练 word2vec 的计算成本很高。日志中的异常不仅取决于单个日志消息,还取决于日志消息序列。因此,word2vec 生成的单词向量不能直接使用,需要转换为日志事件向量,然后进一步转换为日志序列向量。为了降低计算成本并避免多次转换,本文提出了一种离线特征提取模型,称为 LogEvent2vec,它将日志事件作为 word2vec 的输入,直接提取日志事件之间的相关性并对日志事件进行向量化。LogEvent2vec 可以与任何坐标变换方法和异常检测模型配合使用。获取日志事件向量后,我们通过重心或 tf-idf 将日志事件向量转换为日志序列向量,并使用三种监督模型(随机森林、朴素贝叶斯和神经网络)进行训练以检测异常。我们在 BlueGene/L (BGL) 的真实公共日志数据集上进行了广泛的实验。实验结果表明,与 word2vec 相比,LogEvent2vec 可以显著减少 30 倍的计算时间并提高准确性。使用重心和随机森林的 LogEvent2vec 可以获得最佳的 F1 分数,而使用 tf-idf 和朴素贝叶斯的 LogEvent2vec 需要的计算时间最少。