用于低地球轨道卫星网络资源分配的强化学习

Reinforcement learning for resource allocation in LEO satellite networks.

作者信息

Usaha Wipawee, Barria Javier A

机构信息

School of Telecommunication Engineering, Suranaree University of Technology, Nakorn Ratchasima 30000, Thailand.

出版信息

IEEE Trans Syst Man Cybern B Cybern. 2007 Jun;37(3):515-27. doi: 10.1109/tsmcb.2006.886173.

DOI:10.1109/tsmcb.2006.886173

PMID:17550108

Abstract

In this paper, we develop and assess online decision-making algorithms for call admission and routing for low Earth orbit (LEO) satellite networks. It has been shown in a recent paper that, in a LEO satellite system, a semi-Markov decision process formulation of the call admission and routing problem can achieve better performance in terms of an average revenue function than existing routing methods. However, the conventional dynamic programming (DP) numerical solution becomes prohibited as the problem size increases. In this paper, two solution methods based on reinforcement learning (RL) are proposed in order to circumvent the computational burden of DP. The first method is based on an actor-critic method with temporal-difference (TD) learning. The second method is based on a critic-only method, called optimistic TD learning. The algorithms enhance performance in terms of requirements in storage, computational complexity and computational time, and in terms of an overall long-term average revenue function that penalizes blocked calls. Numerical studies are carried out, and the results obtained show that the RL framework can achieve up to 56% higher average revenue over existing routing methods used in LEO satellite networks with reasonable storage and computational requirements.

摘要

在本文中，我们开发并评估了用于低地球轨道（LEO）卫星网络呼叫接纳和路由的在线决策算法。最近的一篇论文表明，在LEO卫星系统中，呼叫接纳和路由问题的半马尔可夫决策过程公式化在平均收益函数方面比现有路由方法能实现更好的性能。然而，随着问题规模的增加，传统的动态规划（DP）数值解变得不可行。为了规避DP的计算负担，本文提出了两种基于强化学习（RL）的解决方法。第一种方法基于带有时间差分（TD）学习的演员-评论家方法。第二种方法基于一种仅含评论家的方法，称为乐观TD学习。这些算法在存储需求、计算复杂度和计算时间方面，以及在惩罚阻塞呼叫的整体长期平均收益函数方面提高了性能。进行了数值研究，结果表明，与LEO卫星网络中使用的现有路由方法相比，RL框架在合理的存储和计算需求下，平均收益可提高高达56%。