一种用于云计算的基于分层深度强化学习的新型云任务调度框架。

An novel cloud task scheduling framework using hierarchical deep reinforcement learning for cloud computing.

作者信息

Cui Delong, Peng Zhiping, Li Kaibin, Li Qirui, He Jieguang, Deng Xiangwu

机构信息

College of Electronic Information Engineering, Guangdong University of Petrochemical Technology, Maoming, China.

Jiangmen Polytechnic, Jiangmen, China.

出版信息

PLoS One. 2025 Aug 21;20(8):e0329669. doi: 10.1371/journal.pone.0329669. eCollection 2025.

DOI:10.1371/journal.pone.0329669

PMID:40839622

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12370147/

Abstract

With the increasing popularity of cloud computing services, their large and dynamic load characteristics have rendered task scheduling an NP-complete problem.To address the problem of large-scale task scheduling in a cloud computing environment, this paper proposes a novel cloud task scheduling framework using hierarchical deep reinforcement learning (DRL) to address the challenges of large-scale task scheduling in cloud computing. The framework defines a set of virtual machines (VMs) as a VM cluster and employs hierarchical scheduling to allocate tasks first to the cluster and then to individual VMs. The scheduler, designed using DRL, adapts to dynamic changes in the cloud environments by continuously learning and updating network parameters. Experiments demonstrate that it skillfully balances cost and performance. In low-load situations, costs are reduced by using low-cost nodes within the Service Level Agreement (SLA) range; in high-load situations, resource utilization is improved through load balancing. Compared with classical heuristic algorithms, it effectively optimizes load balancing, cost, and overdue time, achieving a 10% overall improvement. The experimental results demonstrate that this approach effectively balances cost and performance, optimizing objectives such as load balance, cost, and overdue time. One potential shortcoming of the proposed hierarchical deep reinforcement learning (DRL) framework for cloud task scheduling is its complexity and computational overhead. Implementing and maintaining a DRL-based scheduler requires significant computational resources and expertise in machine learning. There are still shortcomings in the method used in this study. First, the continuous learning and updating of network parameters might introduce latency, which could impact real-time task scheduling efficiency. Furthermore, the framework's performance heavily depends on the quality and quantity of training data, which might be challenging to obtain and maintain in a dynamic cloud environment.

摘要

随着云计算服务越来越普及，其巨大且动态的负载特性使得任务调度成为一个NP完全问题。为了解决云计算环境中的大规模任务调度问题，本文提出了一种新颖的云任务调度框架，使用分层深度强化学习（DRL）来应对云计算中大规模任务调度的挑战。该框架将一组虚拟机（VM）定义为一个VM集群，并采用分层调度，先将任务分配到集群，然后再分配到各个VM。使用DRL设计的调度器通过不断学习和更新网络参数来适应云环境的动态变化。实验表明，它巧妙地平衡了成本和性能。在低负载情况下，通过在服务水平协议（SLA）范围内使用低成本节点来降低成本；在高负载情况下，通过负载均衡提高资源利用率。与经典启发式算法相比，它有效地优化了负载均衡、成本和逾期时间，整体提升了10%。实验结果表明，这种方法有效地平衡了成本和性能，优化了负载均衡、成本和逾期时间等目标。所提出的用于云任务调度的分层深度强化学习（DRL）框架的一个潜在缺点是其复杂性和计算开销。实现和维护基于DRL的调度器需要大量的计算资源和机器学习方面的专业知识。本研究中使用的方法仍存在不足。首先，网络参数的持续学习和更新可能会引入延迟，这可能会影响实时任务调度效率。此外，该框架的性能在很大程度上取决于训练数据的质量和数量，在动态云环境中获取和维护这些数据可能具有挑战性。