Huang Hanchi, Shen Li, Ye Deheng, Liu Wei
IEEE Trans Neural Netw Learn Syst. 2024 Dec;35(12):17608-17619. doi: 10.1109/TNNLS.2023.3306801. Epub 2024 Dec 2.
We propose a novel master-slave architecture to solve the top- combinatorial multiarmed bandits (CMABs) problem with nonlinear bandit feedback and diversity constraints, which, to the best of our knowledge, is the first combinatorial bandits setting considering diversity constraints under bandit feedback. Specifically, to efficiently explore the combinatorial and constrained action space, we introduce six slave models with distinguished merits to generate diversified samples well balancing rewards and constraints as well as efficiency. Moreover, we propose teacher learning-based optimization and the policy cotraining technique to boost the performance of the multiple slave models. The master model then collects the elite samples provided by the slave models and selects the best sample estimated by a neural contextual UCB-based network (NeuralUCB) to decide on a tradeoff between exploration and exploitation. Thanks to the elaborate design of slave models, the cotraining mechanism among slave models, and the novel interactions between the master and slave models, our approach significantly surpasses existing state-of-the-art algorithms in both synthetic and real datasets for recommendation tasks. The code is available at https://github.com/huanghanchi/Master-slave-Algorithm-for-Top-K-Bandits.
我们提出了一种新颖的主从架构,以解决具有非线性博弈反馈和多样性约束的顶级组合多臂博弈(CMABs)问题,据我们所知,这是在博弈反馈下考虑多样性约束的首个组合博弈设置。具体而言,为了有效地探索组合且受约束的动作空间,我们引入了六个具有显著优点的从模型,以生成能在奖励、约束以及效率之间实现良好平衡的多样化样本。此外,我们提出基于教师学习的优化方法和策略协同训练技术,以提升多个从模型的性能。主模型随后收集从模型提供的精英样本,并选择由基于神经上下文上置信界(NeuralUCB)的网络估计出的最佳样本,以在探索和利用之间做出权衡。得益于从模型的精心设计、从模型之间的协同训练机制以及主从模型之间的新颖交互,我们的方法在用于推荐任务的合成数据集和真实数据集中均显著超越了现有的最先进算法。代码可在https://github.com/huanghanchi/Master-slave-Algorithm-for-Top-K-Bandits获取。