Suppr超能文献

AMPRO-HPCC:一种用于预测Slurm高性能计算集群资源的机器学习工具。

AMPRO-HPCC: A Machine-Learning Tool for Predicting Resources on Slurm HPC Clusters.

作者信息

Tanash Mohammed, Andresen Daniel, Hsu William

机构信息

Computer Science Department, Kansas State University, Manhattan, United States.

出版信息

ADVCOMP Int Conf Adv Eng Comput Appl Sci. 2021 Oct;2021:20-27.

Abstract

Determining resource allocations (memory and time) for submitted jobs in High Performance Computing (HPC) systems is a challenging process even for computer scientists. HPC users are highly encouraged to overestimate resource allocation for their submitted jobs, so their jobs will not be killed due to insufficient resources. Overestimating resource allocations occurs because of the wide variety of HPC applications and environment configuration options, and the lack of knowledge of the complex structure of HPC systems. This causes a waste of HPC resources, a decreased utilization of HPC systems, and increased waiting and turnaround time for submitted jobs. In this paper, we introduce our first ever implemented fully-offline, fully-automated, stand-alone, and open-source Machine Learning (ML) tool to help users predict memory and time requirements for their submitted jobs on the cluster. Our tool involves implementing six ML discriminative models from the scikit-learn and Microsoft LightGBM applied on the historical data (sacct data) from Simple Linux Utility for Resource Management (Slurm). We have tested our tool using historical data (saact data) using HPC resources of Kansas State University (Beocat), which covers the years from January 2019 - March 2021, and contains around 17.6 million jobs. Our results show that our tool achieves high predictive accuracy (0.72 using LightGBM for predicting the memory and 0.74 using Random Forest for predicting the time), helps dramatically reduce computational average waiting-time and turnaround time for the submitted jobs, and increases utilization of the HPC resources. Hence, our tool decreases the power consumption of the HPC resources.

摘要

即使对于计算机科学家来说,在高性能计算(HPC)系统中为提交的作业确定资源分配(内存和时间)也是一个具有挑战性的过程。强烈鼓励HPC用户高估其提交作业的资源分配,这样他们的作业就不会因资源不足而被终止。高估资源分配的情况之所以会出现,是因为HPC应用程序和环境配置选项种类繁多,且缺乏对HPC系统复杂结构的了解。这导致了HPC资源的浪费、HPC系统利用率的降低,以及提交作业的等待时间和周转时间的增加。在本文中,我们介绍了我们首次实现的完全离线、全自动、独立且开源的机器学习(ML)工具,以帮助用户预测其在集群上提交作业的内存和时间需求。我们的工具涉及从scikit-learn和Microsoft LightGBM实现六个ML判别模型,并将其应用于来自简单Linux资源管理实用程序(Slurm)的历史数据(sacct数据)。我们使用堪萨斯州立大学(Beocat)的HPC资源,利用历史数据(saact数据)对我们的工具进行了测试,该数据涵盖了2019年1月至2021年3月,包含约1760万个作业。我们的结果表明,我们的工具实现了较高的预测准确率(使用LightGBM预测内存时为0.72,使用随机森林预测时间时为0.74),极大地帮助减少了提交作业的计算平均等待时间和周转时间,并提高了HPC资源的利用率。因此,我们的工具降低了HPC资源的功耗。

相似文献

9
SPIM workflow manager for HPC.用于高性能计算的 SPIM 工作流管理器。
Bioinformatics. 2019 Oct 1;35(19):3875-3876. doi: 10.1093/bioinformatics/btz140.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验