Department of Physics & INFN, University of Rome 'Tor Vergata', Via della Ricerca Scientifica 1, 00133, Rome, Italy.
Laboratory of Physics of the École Normale Supérieure, 24 RueLhomond, 75005, Paris, France.
Eur Phys J E Soft Matter. 2023 Mar 3;46(3):9. doi: 10.1140/epje/s10189-023-00271-0.
We consider the problem of two active particles in 2D complex flows with the multi-objective goals of minimizing both the dispersion rate and the control activation cost of the pair. We approach the problem by means of multi-objective reinforcement learning (MORL), combining scalarization techniques together with a Q-learning algorithm, for Lagrangian drifters that have variable swimming velocity. We show that MORL is able to find a set of trade-off solutions forming an optimal Pareto frontier. As a benchmark, we show that a set of heuristic strategies are dominated by the MORL solutions. We consider the situation in which the agents cannot update their control variables continuously, but only after a discrete (decision) time, [Formula: see text]. We show that there is a range of decision times, in between the Lyapunov time and the continuous updating limit, where reinforcement learning finds strategies that significantly improve over heuristics. In particular, we discuss how large decision times require enhanced knowledge of the flow, whereas for smaller [Formula: see text] all a priori heuristic strategies become Pareto optimal.
我们考虑二维复流场中两个主动粒子的问题,其多目标是最小化粒子对的扩散率和控制激活成本。我们通过多目标强化学习(MORL)来解决这个问题,将标量化技术与 Q-learning 算法相结合,用于具有可变游动速度的拉格朗日漂流物。我们表明,MORL 能够找到一组形成最优帕累托前沿的折衷解。作为基准,我们表明一组启发式策略被 MORL 解决方案所支配。我们考虑这样一种情况,即代理不能连续更新其控制变量,而是只能在离散(决策)时间[Formula: see text]后进行更新。我们表明,在 Lyapunov 时间和连续更新极限之间存在一个决策时间范围,在这个范围内,强化学习找到了比启发式策略显著改进的策略。特别是,我们讨论了大的决策时间需要对流动有更多的了解,而对于较小的[Formula: see text],所有先验启发式策略都成为帕累托最优。