一种用于网格计算系统作业调度的具有容错能力的改进蚁群优化算法。

An improved ant colony optimization algorithm with fault tolerance for job scheduling in grid computing systems.

作者信息

Idris Hajara, Ezugwu Absalom E, Junaidu Sahalu B, Adewumi Aderemi O

机构信息

Department of Mathematics, Ahmadu Bello University Zaria, Nigeria.

Department of Computer Science, Federal University Lafia, Nasarawa State, Nigeria.

出版信息

PLoS One. 2017 May 17;12(5):e0177567. doi: 10.1371/journal.pone.0177567. eCollection 2017.

DOI:10.1371/journal.pone.0177567

PMID:28545075

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5435234/

Abstract

The Grid scheduler, schedules user jobs on the best available resource in terms of resource characteristics by optimizing job execution time. Resource failure in Grid is no longer an exception but a regular occurring event as resources are increasingly being used by the scientific community to solve computationally intensive problems which typically run for days or even months. It is therefore absolutely essential that these long-running applications are able to tolerate failures and avoid re-computations from scratch after resource failure has occurred, to satisfy the user's Quality of Service (QoS) requirement. Job Scheduling with Fault Tolerance in Grid Computing using Ant Colony Optimization is proposed to ensure that jobs are executed successfully even when resource failure has occurred. The technique employed in this paper, is the use of resource failure rate, as well as checkpoint-based roll back recovery strategy. Check-pointing aims at reducing the amount of work that is lost upon failure of the system by immediately saving the state of the system. A comparison of the proposed approach with an existing Ant Colony Optimization (ACO) algorithm is discussed. The experimental results of the implemented Fault Tolerance scheduling algorithm show that there is an improvement in the user's QoS requirement over the existing ACO algorithm, which has no fault tolerance integrated in it. The performance evaluation of the two algorithms was measured in terms of the three main scheduling performance metrics: makespan, throughput and average turnaround time.

摘要

网格调度器根据资源特征，通过优化作业执行时间，将用户作业调度到最佳可用资源上。在网格中，资源故障不再是例外情况，而是经常发生的事件，因为科学界越来越多地使用资源来解决通常运行数天甚至数月的计算密集型问题。因此，这些长时间运行的应用程序必须能够容忍故障，并在资源故障发生后避免从头重新计算，以满足用户的服务质量（QoS）要求。本文提出了一种基于蚁群优化的网格计算容错作业调度方法，以确保即使在资源发生故障时作业也能成功执行。本文采用的技术是利用资源故障率以及基于检查点的回滚恢复策略。检查点的目的是通过立即保存系统状态来减少系统故障时丢失的工作量。本文讨论了所提出的方法与现有蚁群优化（ACO）算法的比较。所实现的容错调度算法的实验结果表明，与未集成容错功能的现有ACO算法相比，用户的QoS要求得到了提高。两种算法的性能评估是根据三个主要调度性能指标进行的：完工时间、吞吐量和平均周转时间。