Tanash Mohammed, Dunn Brandon, Andresen Daniel, Hsu William, Yang Huichen, Okanlawon Adedolapo
Kansas State University, Manhattan, Kansas.
PEARC19 (2019). 2019 Jul;2019. doi: 10.1145/3332186.3333041. Epub 2019 Jul 28.
High-Performance Computing (HPC) systems are resources utilized for data capture, sharing, and analysis. The majority of our HPC users come from other disciplines than Computer Science. HPC users including computer scientists have difficulties and do not feel proficient enough to decide the required amount of resources for their submitted jobs on the cluster. Consequently, users are encouraged to over-estimate resources for their submitted jobs, so their jobs will not be killing due insufficient resources. This process will waste and devour HPC resources; hence, this will lead to inefficient cluster utilization. We created a supervised machine learning model and integrated it into the Slurm resource manager simulator to predict the amount of required memory resources (Memory) and the required amount of time to run the computation. Our model involves using different machine learning algorithms. Our goal is to integrate and test the proposed supervised machine learning model on Slurm. We used over 10000 tasks selected from our HPC log files to evaluate the performance and the accuracy of our integrated model. The purpose of our work is to increase the performance of the Slurm by predicting the amount of require jobs memory resources and the time required for each particular job in order to improve the utilization of the HPC system using our integrated supervised machine learning model. Our results indicate that for larger jobs our model helps dramatically reduce computational turnaround time (from five days to ten hours for large jobs), substantially increased utilization of the HPC system, and decreased the average waiting time for the submitted jobs.
高性能计算(HPC)系统是用于数据采集、共享和分析的资源。我们的大多数HPC用户并非来自计算机科学领域。包括计算机科学家在内的HPC用户在为其在集群上提交的作业确定所需资源量时存在困难,并且感觉自己不够熟练。因此,鼓励用户对其提交的作业高估资源量,这样他们的作业就不会因资源不足而被终止。这个过程会浪费并消耗HPC资源;因此,这将导致集群利用效率低下。我们创建了一个监督式机器学习模型,并将其集成到Slurm资源管理器模拟器中,以预测所需的内存资源量(内存)和运行计算所需的时间量。我们的模型涉及使用不同的机器学习算法。我们的目标是在Slurm上集成并测试所提出的监督式机器学习模型。我们使用从HPC日志文件中选取的10000多个任务来评估我们集成模型的性能和准确性。我们工作的目的是通过预测所需作业内存资源量和每个特定作业所需的时间来提高Slurm的性能,以便使用我们集成的监督式机器学习模型来提高HPC系统的利用率。我们的结果表明,对于大型作业,我们的模型有助于大幅减少计算周转时间(大型作业从五天减少到十小时),大幅提高HPC系统的利用率,并减少提交作业的平均等待时间。