掉队者感知的分布式学习：通信-计算延迟权衡

Straggler-Aware Distributed Learning: Communication-Computation Latency Trade-Off.

作者信息

Ozfatura Emre, Ulukus Sennur, Gündüz Deniz

机构信息

Information Processing and Communications Lab, Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, UK.

Department of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USA.

出版信息

Entropy (Basel). 2020 May 13;22(5):544. doi: 10.3390/e22050544.

DOI:10.3390/e22050544

PMID:33286316

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7517046/

Abstract

When gradient descent (GD) is scaled to many parallel for large-scale machine learning applications, its per-iteration computation time is limited by workers. Straggling workers can be tolerated by assigning redundant computations and/or coding across data and computations, but in most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations. Imposing such a limitation results in two drawbacks: due to inaccurate prediction of the straggling behavior, and due to discarding partial computations carried out by stragglers. To overcome these drawbacks, we consider multi-message communication (MMC) by allowing multiple computations to be conveyed from each worker per iteration, and propose novel straggler avoidance techniques for both coded computation and coded communication with MMC. We analyze how the proposed designs can be employed efficiently to seek a balance between the computation and communication latency. Furthermore, we identify the advantages and disadvantages of these designs in different settings through extensive simulations, both model-based and real implementation on Amazon EC2 servers, and demonstrate that proposed schemes with MMC can help improve upon existing straggler avoidance schemes.

摘要

在大规模机器学习应用中，当梯度下降（GD）扩展到多个并行工作节点时，其每次迭代的计算时间受工作节点的限制。掉队的工作节点可通过在数据和计算中分配冗余计算和/或编码来容忍，但在大多数现有方案中，每个未掉队的工作节点在完成所有计算后，每次迭代向参数服务器（PS）传输一条消息。施加这样的限制会导致两个缺点：一是由于对掉队行为的预测不准确，二是由于丢弃了掉队者进行的部分计算。为克服这些缺点，我们考虑通过允许每个工作节点每次迭代传输多个计算来进行多消息通信（MMC），并针对编码计算和与MMC的编码通信提出了新颖的掉队者避免技术。我们分析了如何有效地采用所提出的设计来在计算和通信延迟之间寻求平衡。此外，我们通过广泛的模拟，包括基于模型的模拟和在亚马逊EC2服务器上的实际实现，确定了这些设计在不同设置中的优缺点，并证明了具有MMC的所提出方案有助于改进现有的掉队者避免方案。