Hadoop MapReduce中的实验分析：深入探讨故障检测与恢复技术

Experimental Analysis in Hadoop MapReduce: A Closer Look at Fault Detection and Recovery Techniques.

作者信息

Saadoon Muntadher, Hamid Siti Hafizah Ab, Sofian Hazrina, Altarturi Hamza, Nasuha Nur, Azizul Zati Hakim, Sani Asmiza Abdul, Asemi Adeleh

机构信息

Department of Software Engineering, Faculty of Computer Science and Information Technology, University Malaya, Kuala Lumpur 50603, Malaysia.

出版信息

Sensors (Basel). 2021 May 31;21(11):3799. doi: 10.3390/s21113799.

DOI:10.3390/s21113799

PMID:34072632

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8199096/

Abstract

Hadoop MapReduce reactively detects and recovers faults after they occur based on the static heartbeat detection and the re-execution from scratch techniques. However, these techniques lead to excessive response time penalties and inefficient resource consumption during detection and recovery. Existing fault-tolerance solutions intend to mitigate the limitations without considering critical conditions such as fail-slow faults, the impact of faults at various infrastructure levels and the relationship between the detection and recovery stages. This paper analyses the response time under two main conditions: fail-stop and fail-slow, when they manifest with node, service, and the task at runtime. In addition, we focus on the relationship between the time for detecting and recovering faults. The experimental analysis is conducted on a real Hadoop cluster comprising MapReduce, YARN and HDFS frameworks. Our analysis shows that the recovery of a single fault leads to an average of 67.6% response time penalty. Even though the detection and recovery times are well-turned, data locality and resource availability must also be considered to obtain the optimum tolerance time and the lowest penalties.

摘要

Hadoop MapReduce基于静态心跳检测和从头重新执行技术，在故障发生后以反应式方式检测并恢复故障。然而，这些技术在检测和恢复期间会导致过长的响应时间惩罚以及资源消耗效率低下。现有的容错解决方案试图减轻这些限制，但未考虑诸如故障缓慢、不同基础设施级别故障的影响以及检测和恢复阶段之间的关系等关键情况。本文分析了在两种主要情况下的响应时间：故障停止和故障缓慢，这两种情况在运行时表现为节点、服务和任务故障。此外，我们关注故障检测时间和恢复时间之间的关系。实验分析是在一个包含MapReduce、YARN和HDFS框架的真实Hadoop集群上进行的。我们的分析表明，单个故障的恢复平均会导致67.6%的响应时间惩罚。即使检测和恢复时间调整得很好，为了获得最佳的容错时间和最低的惩罚，还必须考虑数据局部性和资源可用性。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

Hadoop MapReduce中的实验分析：深入探讨故障检测与恢复技术

Experimental Analysis in Hadoop MapReduce: A Closer Look at Fault Detection and Recovery Techniques.

作者信息

机构信息

出版信息

相似文献

Hadoop MapReduce中的实验分析：深入探讨故障检测与恢复技术

Experimental Analysis in Hadoop MapReduce: A Closer Look at Fault Detection and Recovery Techniques.

作者信息

机构信息

出版信息

相似文献