Li Jueyou, Gu Chuanye, Wu Zhiyou, Huang Tingwen
IEEE Trans Cybern. 2022 Feb;52(2):1009-1020. doi: 10.1109/TCYB.2020.2990796. Epub 2022 Feb 16.
This article focuses on multiagent distributed-constrained optimization problems in a dynamic environment, in which a group of agents aims to cooperatively optimize a sum of time-changing local cost functions subject to time-varying coupled constraints. Both the local cost functions and constraint functions are unrevealed to an individual agent until an action is submitted. We first investigate a gradient-feedback scenario, where each agent can access both values and gradients of cost functions and constraint functions owned by itself at the chosen action. Then, we design a distributed primal-dual online learning algorithm and show that the proposed algorithm can achieve the sublinear bounds for both the regret and constraint violations. Furthermore, we extend the gradient-feedback algorithm to a gradient-free setup, where an individual agent has only attained the values of local cost functions and constraint functions at two queried points near the selected action. We develop a bandit version of the previous method and give the explicitly sublinear bounds on the expected regret and expected constraint violations. The results indicate that the bandit algorithm can achieve almost the same performance as the gradient-feedback algorithm under wild conditions. Finally, numerical simulations on an electric vehicle charging problem demonstrate the effectiveness of the proposed algorithms.
本文聚焦于动态环境下的多智能体分布式约束优化问题,其中一组智能体旨在协同优化随时间变化的局部成本函数之和,并受随时间变化的耦合约束。在提交动作之前,局部成本函数和约束函数对单个智能体都是不可见的。我们首先研究一种梯度反馈情形,即每个智能体在选定动作时可以获取自身拥有的成本函数和约束函数的值与梯度。然后,我们设计了一种分布式原始对偶在线学习算法,并表明所提算法能在遗憾值和约束违反方面都达到次线性界。此外,我们将梯度反馈算法扩展到无梯度设置,即单个智能体仅能获取在所选动作附近两个查询点处的局部成本函数和约束函数的值。我们开发了先前方法的一种强化学习版本,并给出了预期遗憾值和预期约束违反的显式次线性界。结果表明,在宽泛条件下,强化学习算法能实现与梯度反馈算法几乎相同的性能。最后,针对电动汽车充电问题的数值模拟证明了所提算法的有效性。