VMware, Inc., Palo Alto, CA 94304, USA.
Institute of Mathematics of NAS RA, Yerevan 0019, Armenia.
Sensors (Basel). 2021 Feb 25;21(5):1590. doi: 10.3390/s21051590.
The main purpose of an application performance monitoring/management (APM) software is to ensure the highest availability, efficiency and security of applications. An APM software accomplishes the main goals through automation, measurements, analysis and diagnostics. Gartner specifies the three crucial capabilities of APM softwares. The first is an end-user experience monitoring for revealing the interactions of users with application and infrastructure components. The second is application discovery, diagnostics and tracing. The third key component is machine learning (ML) and artificial intelligence (AI) powered data analytics for predictions, anomaly detection, event correlations and root cause analysis. Time series metrics, logs and traces are the three pillars of observability and the valuable source of information for IT operations. Accurate, scalable and robust time series forecasting and anomaly detection are the requested capabilities of the analytics. Approaches based on neural networks (NN) and deep learning gain an increasing popularity due to their flexibility and ability to tackle complex nonlinear problems. However, some of the disadvantages of NN-based models for distributed cloud applications mitigate expectations and require specific approaches. We demonstrate how NN-models, pretrained on a global time series database, can be applied to customer specific data using transfer learning. In general, NN-models adequately operate only on stationary time series. Application to nonstationary time series requires multilayer data processing including hypothesis testing for data categorization, category specific transformations into stationary data, forecasting and backward transformations. We present the mathematical background of this approach and discuss experimental results based on implementation for Wavefront by VMware (an APM software) while monitoring real customer cloud environments.
应用性能监控/管理 (APM) 软件的主要目的是确保应用程序的最高可用性、效率和安全性。APM 软件通过自动化、测量、分析和诊断来实现主要目标。高德纳 (Gartner) 指定了 APM 软件的三个关键功能。第一个是端到端用户体验监控,用于揭示用户与应用程序和基础架构组件的交互。第二个是应用程序发现、诊断和跟踪。第三个关键组件是机器学习 (ML) 和人工智能 (AI) 驱动的数据分析,用于预测、异常检测、事件关联和根本原因分析。时间序列指标、日志和跟踪是可观察性的三大支柱,也是 IT 运营的有价值信息来源。准确、可扩展和强大的时间序列预测和异常检测是分析的要求功能。基于神经网络 (NN) 和深度学习的方法由于其灵活性和解决复杂非线性问题的能力而越来越受欢迎。然而,基于 NN 的模型在分布式云应用中的一些缺点降低了人们的期望,并需要特定的方法。我们展示了如何使用迁移学习将在全球时间序列数据库上预训练的 NN 模型应用于特定于客户的数据。一般来说,NN 模型仅在平稳时间序列上正常运行。应用于非平稳时间序列需要多层数据处理,包括数据分类的假设检验、类别特定的平稳数据转换、预测和反向转换。我们介绍了这种方法的数学背景,并讨论了基于对 VMware (APM 软件)的 Wavefront 的实现的实验结果,同时监控真实的客户云环境。