Ho Qirong, Cipar James, Cui Henggang, Kim Jin Kyu, Lee Seunghak, Gibbons Phillip B, Gibson Garth A, Ganger Gregory R, Xing Eric P
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213.
Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213.
Adv Neural Inf Process Syst. 2013;2013:1223-1231.
We propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML algorithms, while still providing correctness guarantees. The parameter server provides an easy-to-use shared interface for read/write access to an ML model's values (parameters and variables), and the SSP model allows distributed workers to read older, stale versions of these values from a local cache, instead of waiting to get them from a central storage. This significantly increases the proportion of time workers spend computing, as opposed to waiting. Furthermore, the SSP model ensures ML algorithm correctness by limiting the maximum age of the stale values. We provide a proof of correctness under SSP, as well as empirical results demonstrating that the SSP model achieves faster algorithm convergence on several different ML problems, compared to fully-synchronous and asynchronous schemes.
我们提出了一种用于分布式机器学习的参数服务器系统,该系统遵循一种陈旧同步并行(SSP)计算模型,该模型能最大限度地增加计算工作节点在机器学习算法上进行有用工作的时间,同时仍能提供正确性保证。参数服务器提供了一个易于使用的共享接口,用于对机器学习模型的值(参数和变量)进行读/写访问,并且SSP模型允许分布式工作节点从本地缓存中读取这些值的较旧、陈旧版本,而不是等待从中央存储中获取。这显著增加了工作节点用于计算而非等待的时间比例。此外,SSP模型通过限制陈旧值的最大使用期限来确保机器学习算法的正确性。我们提供了SSP模型下的正确性证明,以及实证结果,表明与完全同步和异步方案相比,SSP模型在几个不同的机器学习问题上实现了更快的算法收敛。