Saadoon Muntadher, Hamid Siti Hafizah Ab, Sofian Hazrina, Altarturi Hamza, Nasuha Nur, Azizul Zati Hakim, Sani Asmiza Abdul, Asemi Adeleh
Department of Software Engineering, Faculty of Computer Science and Information Technology, University Malaya, Kuala Lumpur 50603, Malaysia.
Sensors (Basel). 2021 May 31;21(11):3799. doi: 10.3390/s21113799.
Hadoop MapReduce reactively detects and recovers faults after they occur based on the static heartbeat detection and the re-execution from scratch techniques. However, these techniques lead to excessive response time penalties and inefficient resource consumption during detection and recovery. Existing fault-tolerance solutions intend to mitigate the limitations without considering critical conditions such as fail-slow faults, the impact of faults at various infrastructure levels and the relationship between the detection and recovery stages. This paper analyses the response time under two main conditions: fail-stop and fail-slow, when they manifest with node, service, and the task at runtime. In addition, we focus on the relationship between the time for detecting and recovering faults. The experimental analysis is conducted on a real Hadoop cluster comprising MapReduce, YARN and HDFS frameworks. Our analysis shows that the recovery of a single fault leads to an average of 67.6% response time penalty. Even though the detection and recovery times are well-turned, data locality and resource availability must also be considered to obtain the optimum tolerance time and the lowest penalties.
Hadoop MapReduce基于静态心跳检测和从头重新执行技术,在故障发生后以反应式方式检测并恢复故障。然而,这些技术在检测和恢复期间会导致过长的响应时间惩罚以及资源消耗效率低下。现有的容错解决方案试图减轻这些限制,但未考虑诸如故障缓慢、不同基础设施级别故障的影响以及检测和恢复阶段之间的关系等关键情况。本文分析了在两种主要情况下的响应时间:故障停止和故障缓慢,这两种情况在运行时表现为节点、服务和任务故障。此外,我们关注故障检测时间和恢复时间之间的关系。实验分析是在一个包含MapReduce、YARN和HDFS框架的真实Hadoop集群上进行的。我们的分析表明,单个故障的恢复平均会导致67.6%的响应时间惩罚。即使检测和恢复时间调整得很好,为了获得最佳的容错时间和最低的惩罚,还必须考虑数据局部性和资源可用性。