Zhao Tuo, Yu Mo, Wang Yiming, Arora Raman, Liu Han
Johns Hopkins University ; Princeton University.
Harbin Institute of Technology.
Adv Neural Inf Process Syst. 2014 Dec;27:5614.
We consider regularized empirical risk minimization problems. In particular, we minimize the sum of a smooth empirical risk function and a nonsmooth regularization function. When the regularization function is block separable, we can solve the minimization problems in a randomized block coordinate descent (RBCD) manner. Existing RBCD methods usually decrease the objective value by exploiting the partial gradient of a randomly selected block of coordinates in each iteration. Thus they need all data to be accessible so that the partial gradient of the block gradient can be exactly obtained. However, such a "batch" setting may be computationally expensive in practice. In this paper, we propose a mini-batch randomized block coordinate descent (MRBCD) method, which estimates the partial gradient of the selected block based on a mini-batch of randomly sampled data in each iteration. We further accelerate the MRBCD method by exploiting the semi-stochastic optimization scheme, which effectively reduces the variance of the partial gradient estimators. Theoretically, we show that for strongly convex functions, the MRBCD method attains lower overall iteration complexity than existing RBCD methods. As an application, we further trim the MRBCD method to solve the regularized sparse learning problems. Our numerical experiments shows that the MRBCD method naturally exploits the sparsity structure and achieves better computational performance than existing methods.
我们考虑正则化经验风险最小化问题。具体而言,我们最小化一个光滑经验风险函数与一个非光滑正则化函数的和。当正则化函数是块可分的时,我们可以以随机块坐标下降(RBCD)的方式求解最小化问题。现有的RBCD方法通常在每次迭代中通过利用随机选择的坐标块的部分梯度来降低目标值。因此,它们需要所有数据都可访问,以便能够精确获得块梯度的部分梯度。然而,这种“批量”设置在实际中可能计算成本很高。在本文中,我们提出了一种小批量随机块坐标下降(MRBCD)方法,该方法在每次迭代中基于一小批随机采样的数据来估计所选块的部分梯度。我们通过利用半随机优化方案进一步加速MRBCD方法,这有效地降低了部分梯度估计器的方差。从理论上讲,我们表明对于强凸函数,MRBCD方法比现有的RBCD方法具有更低的总体迭代复杂度。作为一个应用,我们进一步调整MRBCD方法来解决正则化稀疏学习问题。我们的数值实验表明,MRBCD方法自然地利用了稀疏结构,并且比现有方法具有更好的计算性能。