• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

掉队者感知的分布式学习:通信-计算延迟权衡

Straggler-Aware Distributed Learning: Communication-Computation Latency Trade-Off.

作者信息

Ozfatura Emre, Ulukus Sennur, Gündüz Deniz

机构信息

Information Processing and Communications Lab, Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, UK.

Department of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USA.

出版信息

Entropy (Basel). 2020 May 13;22(5):544. doi: 10.3390/e22050544.

DOI:10.3390/e22050544
PMID:33286316
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7517046/
Abstract

When gradient descent (GD) is scaled to many parallel for large-scale machine learning applications, its per-iteration computation time is limited by workers. Straggling workers can be tolerated by assigning redundant computations and/or coding across data and computations, but in most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations. Imposing such a limitation results in two drawbacks: due to inaccurate prediction of the straggling behavior, and due to discarding partial computations carried out by stragglers. To overcome these drawbacks, we consider multi-message communication (MMC) by allowing multiple computations to be conveyed from each worker per iteration, and propose novel straggler avoidance techniques for both coded computation and coded communication with MMC. We analyze how the proposed designs can be employed efficiently to seek a balance between the computation and communication latency. Furthermore, we identify the advantages and disadvantages of these designs in different settings through extensive simulations, both model-based and real implementation on Amazon EC2 servers, and demonstrate that proposed schemes with MMC can help improve upon existing straggler avoidance schemes.

摘要

在大规模机器学习应用中,当梯度下降(GD)扩展到多个并行工作节点时,其每次迭代的计算时间受工作节点的限制。掉队的工作节点可通过在数据和计算中分配冗余计算和/或编码来容忍,但在大多数现有方案中,每个未掉队的工作节点在完成所有计算后,每次迭代向参数服务器(PS)传输一条消息。施加这样的限制会导致两个缺点:一是由于对掉队行为的预测不准确,二是由于丢弃了掉队者进行的部分计算。为克服这些缺点,我们考虑通过允许每个工作节点每次迭代传输多个计算来进行多消息通信(MMC),并针对编码计算和与MMC的编码通信提出了新颖的掉队者避免技术。我们分析了如何有效地采用所提出的设计来在计算和通信延迟之间寻求平衡。此外,我们通过广泛的模拟,包括基于模型的模拟和在亚马逊EC2服务器上的实际实现,确定了这些设计在不同设置中的优缺点,并证明了具有MMC的所提出方案有助于改进现有的掉队者避免方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/cc4e40c06a59/entropy-22-00544-g014.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/c35161f4fa0d/entropy-22-00544-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/e86610eb52cc/entropy-22-00544-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/d2eecd833b61/entropy-22-00544-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/3fa504b8f9e5/entropy-22-00544-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/5bb53cd347d2/entropy-22-00544-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/2e677f0d967c/entropy-22-00544-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/e4ec634cddb7/entropy-22-00544-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/c9eb47dce910/entropy-22-00544-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/e14c70ea3944/entropy-22-00544-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/6f6a8d73a2b7/entropy-22-00544-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/e472a987187d/entropy-22-00544-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/e61b579430a4/entropy-22-00544-g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/e64a34551cb5/entropy-22-00544-g013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/cc4e40c06a59/entropy-22-00544-g014.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/c35161f4fa0d/entropy-22-00544-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/e86610eb52cc/entropy-22-00544-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/d2eecd833b61/entropy-22-00544-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/3fa504b8f9e5/entropy-22-00544-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/5bb53cd347d2/entropy-22-00544-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/2e677f0d967c/entropy-22-00544-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/e4ec634cddb7/entropy-22-00544-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/c9eb47dce910/entropy-22-00544-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/e14c70ea3944/entropy-22-00544-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/6f6a8d73a2b7/entropy-22-00544-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/e472a987187d/entropy-22-00544-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/e61b579430a4/entropy-22-00544-g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/e64a34551cb5/entropy-22-00544-g013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a5c1/7517046/cc4e40c06a59/entropy-22-00544-g014.jpg

相似文献

1
Straggler-Aware Distributed Learning: Communication-Computation Latency Trade-Off.掉队者感知的分布式学习:通信-计算延迟权衡
Entropy (Basel). 2020 May 13;22(5):544. doi: 10.3390/e22050544.
2
LAGC: Lazily Aggregated Gradient Coding for Straggler-Tolerant and Communication-Efficient Distributed Learning.LAGC:用于容忍稀疏和提高通信效率的分布式学习的惰性聚合梯度编码。
IEEE Trans Neural Netw Learn Syst. 2021 Mar;32(3):962-974. doi: 10.1109/TNNLS.2020.2979762. Epub 2021 Mar 1.
3
Berrut Approximated Coded Computing: Straggler Resistance Beyond Polynomial Computing.
IEEE Trans Pattern Anal Mach Intell. 2023 Jan;45(1):111-122. doi: 10.1109/TPAMI.2022.3151434. Epub 2022 Dec 5.
4
Network Coding Approaches for Distributed Computation over Lossy Wireless Networks.有损无线网络上分布式计算的网络编码方法
Entropy (Basel). 2023 Feb 27;25(3):428. doi: 10.3390/e25030428.
5
Straggler- and Adversary-Tolerant Secure Distributed Matrix Multiplication Using Polynomial Codes.使用多项式码的容忍掉队者和对手的安全分布式矩阵乘法
Entropy (Basel). 2023 Jan 31;25(2):266. doi: 10.3390/e25020266.
6
DPro-SM - A distributed framework for proactive straggler mitigation using LSTM.DPro-SM - 一种使用长短期记忆网络(LSTM)减轻掉队者影响的分布式框架。
Heliyon. 2023 Dec 10;10(1):e23567. doi: 10.1016/j.heliyon.2023.e23567. eCollection 2024 Jan 15.
7
A Communication-Efficient Distributed Matrix Multiplication Scheme with Privacy, Security, and Resiliency.一种具有隐私性、安全性和弹性的通信高效分布式矩阵乘法方案。
Entropy (Basel). 2024 Aug 30;26(9):743. doi: 10.3390/e26090743.
8
A Cluster-Driven Adaptive Training Approach for Federated Learning.一种基于簇的联邦学习自适应训练方法。
Sensors (Basel). 2022 Sep 18;22(18):7061. doi: 10.3390/s22187061.
9
A Parameter Communication Optimization Strategy for Distributed Machine Learning in Sensors.一种用于传感器中分布式机器学习的参数通信优化策略。
Sensors (Basel). 2017 Sep 21;17(10):2172. doi: 10.3390/s17102172.
10
DisSAGD: A Distributed Parameter Update Scheme Based on Variance Reduction.DisSAGD:一种基于方差缩减的分布式参数更新方案。
Sensors (Basel). 2021 Jul 28;21(15):5124. doi: 10.3390/s21155124.

引用本文的文献

1
Cost-Efficient Distributed Learning via Combinatorial Multi-Armed Bandits.通过组合多臂老虎机实现成本高效的分布式学习
Entropy (Basel). 2025 May 20;27(5):541. doi: 10.3390/e27050541.
2
Adaptive Privacy-Preserving Coded Computing with Hierarchical Task Partitioning.具有分层任务划分的自适应隐私保护编码计算
Entropy (Basel). 2024 Oct 21;26(10):881. doi: 10.3390/e26100881.
3
DPro-SM - A distributed framework for proactive straggler mitigation using LSTM.DPro-SM - 一种使用长短期记忆网络(LSTM)减轻掉队者影响的分布式框架。

本文引用的文献

1
LAGC: Lazily Aggregated Gradient Coding for Straggler-Tolerant and Communication-Efficient Distributed Learning.LAGC:用于容忍稀疏和提高通信效率的分布式学习的惰性聚合梯度编码。
IEEE Trans Neural Netw Learn Syst. 2021 Mar;32(3):962-974. doi: 10.1109/TNNLS.2020.2979762. Epub 2021 Mar 1.
Heliyon. 2023 Dec 10;10(1):e23567. doi: 10.1016/j.heliyon.2023.e23567. eCollection 2024 Jan 15.