Business Information Systems Department, Arab Academy for Science Technology and Maritime Transport, Cairo 11799, Egypt.
Information & Computing Lab, AtlanTTIC Research Center, Universidade de Vigo, 36310 Vigo, Spain.
Sensors (Basel). 2020 Jul 23;20(15):4111. doi: 10.3390/s20154111.
Performance analysis is an essential task in high-performance computing (HPC) systems, and it is applied for different purposes, such as anomaly detection, optimal resource allocation, and budget planning. HPC monitoring tasks generate a huge number of key performance indicators (KPIs) to supervise the status of the jobs running in these systems. KPIs give data about CPU usage, memory usage, network (interface) traffic, or other sensors that monitor the hardware. Analyzing this data, it is possible to obtain insightful information about running jobs, such as their characteristics, performance, and failures. The main contribution in this paper was to identify which metric/s (KPIs) is/are the most appropriate to identify/classify different types of jobs according to their behavior in the HPC system. With this aim, we had applied different clustering techniques (partition and hierarchical clustering algorithms) using a real dataset from the Galician computation center (CESGA). We concluded that (i) those metrics (KPIs) related to the network (interface) traffic monitoring provided the best cohesion and separation to cluster HPC jobs, and (ii) hierarchical clustering algorithms were the most suitable for this task. Our approach was validated using a different real dataset from the same HPC center.
性能分析是高性能计算 (HPC) 系统中的一项重要任务,它可应用于多种目的,如异常检测、最优资源分配和预算规划。HPC 监控任务会生成大量关键性能指标 (KPI) 来监控系统中运行作业的状态。KPI 提供有关 CPU 使用情况、内存使用情况、网络(接口)流量或其他监控硬件的传感器的数据。分析这些数据,可以获得有关正在运行的作业的有见地的信息,例如它们的特征、性能和故障。本文的主要贡献是根据作业在 HPC 系统中的行为,确定哪些指标 (KPI) 最适合识别/分类不同类型的作业。为此,我们使用加利西亚计算中心 (CESGA) 的真实数据集应用了不同的聚类技术(分区和层次聚类算法)。我们得出结论:(i) 与网络(接口)流量监控相关的那些指标 (KPI) 提供了聚类 HPC 作业的最佳内聚性和分离性,以及 (ii) 层次聚类算法最适合此任务。我们的方法使用来自同一 HPC 中心的不同真实数据集进行了验证。