使用深度学习方法的分布式系统中的容错能力。

Fault tolerance in distributed systems using deep learning approaches.

作者信息

Assiri Basem, Sheneamer Abdullah

机构信息

Computer Science Department, Faculty of Engineering and Computer Science, Jazan University, Jazan, Saudi Arabia.

出版信息

PLoS One. 2025 Jan 7;20(1):e0310657. doi: 10.1371/journal.pone.0310657. eCollection 2025.

DOI:10.1371/journal.pone.0310657

PMID:39774399

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11706390/

Abstract

Recently, distributed systems have become the backbone of technological development. It serves as the foundation for new trends technologies such as blockchain, the internet of things and others. A distributed system provides fault tolerance and decentralization, where a fault in any component does not result in a whole system failure. In addition, deep learning model enables processing data to find patterns, which helps in classification, regression, prediction, and clustering. This work employs deep learning to handle faults within distributed systems in three scenarios. Firstly, a faulty processor may not be able to produce the right output. Therefore, deep learning model uses the inputs and outputs of other processors to find patterns and produces the proper output of the faulty processor. Secondly, if a faulty possessor corrupts its inputs as well, then the deep learning model learns from the inputs and the outputs of successful processors and produces the proper output of the faulty processor, even with corrupted inputs. Thirdly, for unrelated data, in which the patterns of the input of the faulty processors differ from the patterns of the inputs of successful ones. In this case, the model is able to discover the new pattern and to be labeled as unknown. In the experiments, we use deep learning models like VGG16, VGG19, AlexNet LSTM and ResNet34, to investigate the performance of the deep learning in the three mentioned scenarios. For unstructured datasets, the accuracy of the models is affected by the size of the faulty data. The accuracy of all models lies between 60% when the size of the faulty data is 90%, and 96%, when the size of the faulty data is 90%. The structured datasets are not significantly affected by the portion of the faulty data and the accuracy reaches 99%.

摘要

最近，分布式系统已成为技术发展的支柱。它是区块链、物联网等新趋势技术的基础。分布式系统提供容错和去中心化功能，即任何组件出现故障都不会导致整个系统失败。此外，深度学习模型能够处理数据以发现模式，这有助于进行分类、回归、预测和聚类。这项工作采用深度学习在三种场景下处理分布式系统中的故障。首先，有故障的处理器可能无法产生正确的输出。因此，深度学习模型利用其他处理器的输入和输出寻找模式，并生成有故障处理器的正确输出。其次，如果有故障的所有者也损坏了其输入，那么深度学习模型会从成功处理器的输入和输出中学习，即使输入已损坏，也能生成有故障处理器的正确输出。第三，对于不相关的数据，有故障处理器的输入模式与成功处理器的输入模式不同。在这种情况下，模型能够发现新模式并将其标记为未知。在实验中，我们使用VGG16、VGG19、AlexNet LSTM和ResNet34等深度学习模型，来研究深度学习在上述三种场景下的性能。对于非结构化数据集，模型的准确率受故障数据大小的影响。当故障数据大小为90%时，所有模型的准确率在60%到96%之间，当故障数据大小为10%时，准确率达到99%。结构化数据集受故障数据比例的影响不显著，准确率达到99%。