He Zilong, Chen Pengfei, Li Xiaoyun, Wang Yongfeng, Yu Guangba, Chen Cailin, Li Xinrui, Zheng Zibin
IEEE Trans Neural Netw Learn Syst. 2023 Apr;34(4):1705-1719. doi: 10.1109/TNNLS.2020.3027736. Epub 2023 Apr 4.
Anomaly detection is a critical task for maintaining the performance of a cloud system. Using data-driven methods to address this issue is the mainstream in recent years. However, due to the lack of labeled data for training in practice, it is necessary to enable an anomaly detection model trained on contaminated data in an unsupervised way. Besides, with the increasing complexity of cloud systems, effectively organizing data collected from a wide range of components of a system and modeling spatiotemporal dependence among them become a challenge. In this article, we propose TopoMAD, a stochastic seq2seq model which can robustly model spatial and temporal dependence among contaminated data. We include system topological information to organize metrics from different components and apply sliding windows over metrics collected continuously to capture the temporal dependence. We extract spatial features with the help of graph neural networks and temporal features with long short-term memory networks. Moreover, we develop our model based on variational auto-encoder, enabling it to work well robustly even when trained on contaminated data. Our approach is validated on the run-time performance data collected from two representative cloud systems, namely, a big data batch processing system and a microservice-based transaction processing system. The experimental results show that TopoMAD outperforms some state-of-the-art methods on these two data sets.
异常检测是维护云系统性能的一项关键任务。近年来,使用数据驱动的方法来解决这个问题已成为主流。然而,由于在实际中缺乏用于训练的标记数据,因此有必要使异常检测模型能够以无监督的方式在受污染的数据上进行训练。此外,随着云系统复杂性的增加,有效地组织从系统的各种组件收集的数据并对它们之间的时空依赖性进行建模成为一项挑战。在本文中,我们提出了TopoMAD,这是一种随机seq2seq模型,它能够对受污染数据之间的空间和时间依赖性进行稳健建模。我们纳入系统拓扑信息来组织来自不同组件的指标,并对连续收集的指标应用滑动窗口以捕捉时间依赖性。我们借助图神经网络提取空间特征,并使用长短期记忆网络提取时间特征。此外,我们基于变分自编码器开发我们的模型,使其即使在受污染数据上进行训练时也能稳健地良好运行。我们的方法在从两个代表性云系统(即大数据批处理系统和基于微服务的事务处理系统)收集的运行时性能数据上得到了验证。实验结果表明,TopoMAD在这两个数据集上优于一些现有的先进方法。