Tanash Mohammed, Andresen Daniel, Hsu William
Computer Science Department, Kansas State University, Manhattan, United States.
ADVCOMP Int Conf Adv Eng Comput Appl Sci. 2021 Oct;2021:20-27.
Determining resource allocations (memory and time) for submitted jobs in High Performance Computing (HPC) systems is a challenging process even for computer scientists. HPC users are highly encouraged to overestimate resource allocation for their submitted jobs, so their jobs will not be killed due to insufficient resources. Overestimating resource allocations occurs because of the wide variety of HPC applications and environment configuration options, and the lack of knowledge of the complex structure of HPC systems. This causes a waste of HPC resources, a decreased utilization of HPC systems, and increased waiting and turnaround time for submitted jobs. In this paper, we introduce our first ever implemented fully-offline, fully-automated, stand-alone, and open-source Machine Learning (ML) tool to help users predict memory and time requirements for their submitted jobs on the cluster. Our tool involves implementing six ML discriminative models from the scikit-learn and Microsoft LightGBM applied on the historical data (sacct data) from Simple Linux Utility for Resource Management (Slurm). We have tested our tool using historical data (saact data) using HPC resources of Kansas State University (Beocat), which covers the years from January 2019 - March 2021, and contains around 17.6 million jobs. Our results show that our tool achieves high predictive accuracy (0.72 using LightGBM for predicting the memory and 0.74 using Random Forest for predicting the time), helps dramatically reduce computational average waiting-time and turnaround time for the submitted jobs, and increases utilization of the HPC resources. Hence, our tool decreases the power consumption of the HPC resources.
即使对于计算机科学家来说,在高性能计算(HPC)系统中为提交的作业确定资源分配(内存和时间)也是一个具有挑战性的过程。强烈鼓励HPC用户高估其提交作业的资源分配,这样他们的作业就不会因资源不足而被终止。高估资源分配的情况之所以会出现,是因为HPC应用程序和环境配置选项种类繁多,且缺乏对HPC系统复杂结构的了解。这导致了HPC资源的浪费、HPC系统利用率的降低,以及提交作业的等待时间和周转时间的增加。在本文中,我们介绍了我们首次实现的完全离线、全自动、独立且开源的机器学习(ML)工具,以帮助用户预测其在集群上提交作业的内存和时间需求。我们的工具涉及从scikit-learn和Microsoft LightGBM实现六个ML判别模型,并将其应用于来自简单Linux资源管理实用程序(Slurm)的历史数据(sacct数据)。我们使用堪萨斯州立大学(Beocat)的HPC资源,利用历史数据(saact数据)对我们的工具进行了测试,该数据涵盖了2019年1月至2021年3月,包含约1760万个作业。我们的结果表明,我们的工具实现了较高的预测准确率(使用LightGBM预测内存时为0.72,使用随机森林预测时间时为0.74),极大地帮助减少了提交作业的计算平均等待时间和周转时间,并提高了HPC资源的利用率。因此,我们的工具降低了HPC资源的功耗。