基于多臂赌博机的用户网络节点选择

Multi-Armed Bandit-Based User Network Node Selection.

作者信息

Gao Qinyan, Xie Zhidong

机构信息

National Innovation Institute of Defense Technology, Academy of Military Science, Beijing 100010, China.

Intelligent Game and Decision Laboratory, Academy of Military Science, Beijing 100091, China.

出版信息

Sensors (Basel). 2024 Jun 24;24(13):4104. doi: 10.3390/s24134104.

DOI:10.3390/s24134104

PMID:39000883

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11244562/

Abstract

In the scenario of an integrated space-air-ground emergency communication network, users encounter the challenge of rapidly identifying the optimal network node amidst the uncertainty and stochastic fluctuations of network states. This study introduces a Multi-Armed Bandit (MAB) model and proposes an optimization algorithm leveraging dynamic variance sampling (DVS). The algorithm posits that the prior distribution of each node's network state conforms to a normal distribution, and by constructing the distribution's expected value and variance, it maximizes the utilization of sample data, thereby maintaining an equilibrium between data exploitation and the exploration of the unknown. Theoretical substantiation is provided to illustrate that the Bayesian regret associated with the algorithm exhibits sublinear growth. Empirical simulations corroborate that the algorithm in question outperforms traditional ε-greedy, Upper Confidence Bound (UCB), and Thompson sampling algorithms in terms of higher cumulative rewards, diminished total regret, accelerated convergence rates, and enhanced system throughput.

摘要

在天地空一体化应急通信网络场景中，用户面临着在网络状态的不确定性和随机波动中快速识别最优网络节点的挑战。本研究引入了多臂赌博机（MAB）模型，并提出了一种利用动态方差采样（DVS）的优化算法。该算法假定每个节点网络状态的先验分布符合正态分布，通过构建该分布的期望值和方差，最大化样本数据的利用率，从而在数据利用和未知探索之间保持平衡。提供了理论证明来说明该算法的贝叶斯遗憾呈现次线性增长。实证模拟证实，该算法在累积奖励更高、总遗憾减少、收敛速度加快和系统吞吐量提高方面优于传统的ε-贪婪算法、上置信界（UCB）算法和汤普森采样算法。