Wang Chien-Chih, Tan Kent Loong, Chen Chun-Ting, Lin Yu-Hsiang, Keerthi S Sathiya, Mahajan Dhruv, Sundararajan S, Lin Chih-Jen
Department of Computer Science, National Taiwan University, Taipei 10617, Taiwan
Department of Physics, National Taiwan University, Taipei 10617, Taiwan
Neural Comput. 2018 Jun;30(6):1673-1724. doi: 10.1162/neco_a_01088. Epub 2018 Apr 13.
Deep learning involves a difficult nonconvex optimization problem with a large number of weights between any two adjacent layers of a deep structure. To handle large data sets or complicated networks, distributed training is needed, but the calculation of function, gradient, and Hessian is expensive. In particular, the communication and the synchronization cost may become a bottleneck. In this letter, we focus on situations where the model is distributedly stored and propose a novel distributed Newton method for training deep neural networks. By variable and feature-wise data partitions and some careful designs, we are able to explicitly use the Jacobian matrix for matrix-vector products in the Newton method. Some techniques are incorporated to reduce the running time as well as memory consumption. First, to reduce the communication cost, we propose a diagonalization method such that an approximate Newton direction can be obtained without communication between machines. Second, we consider subsampled Gauss-Newton matrices for reducing the running time as well as the communication cost. Third, to reduce the synchronization cost, we terminate the process of finding an approximate Newton direction even though some nodes have not finished their tasks. Details of some implementation issues in distributed environments are thoroughly investigated. Experiments demonstrate that the proposed method is effective for the distributed training of deep neural networks. Compared with stochastic gradient methods, it is more robust and may give better test accuracy.
深度学习涉及一个困难的非凸优化问题,在深度结构的任意两个相邻层之间存在大量权重。为了处理大型数据集或复杂网络,需要进行分布式训练,但函数、梯度和海森矩阵的计算成本很高。特别是,通信和同步成本可能成为瓶颈。在这封信中,我们关注模型分布式存储的情况,并提出一种用于训练深度神经网络的新型分布式牛顿法。通过变量和特征维度的数据分区以及一些精心设计,我们能够在牛顿法中明确使用雅可比矩阵进行矩阵向量乘积运算。还采用了一些技术来减少运行时间和内存消耗。首先,为了降低通信成本,我们提出一种对角化方法,使得无需机器间通信就能获得近似牛顿方向。其次,我们考虑使用子采样高斯 - 牛顿矩阵来减少运行时间和通信成本。第三,为了降低同步成本,即使某些节点尚未完成任务,我们也会终止寻找近似牛顿方向的过程。我们对分布式环境中一些实现问题的细节进行了深入研究。实验表明,所提出的方法对于深度神经网络的分布式训练是有效的。与随机梯度方法相比,它更稳健,并且可能给出更好的测试精度。