• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种基于学习的针对线性时态逻辑获胜条件的奖励异步概率博弈合成方法。

A learning-based synthesis approach of reward asynchronous probabilistic games against the linear temporal logic winning condition.

作者信息

Zhao Wei, Liu Zhiming

机构信息

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, Jiangsu, China.

School of Software, Northwestern Polytechnical University, Xi'an, Shaanxi, China.

出版信息

PeerJ Comput Sci. 2022 Sep 5;8:e1094. doi: 10.7717/peerj-cs.1094. eCollection 2022.

DOI:10.7717/peerj-cs.1094
PMID:36091983
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9455281/
Abstract

The traditional synthesis problem is usually solved by constructing a system that fulfills given specifications. The system is constantly interacting with the environment and is opposed to the environment. The problem can be further regarded as solving a two-player game (the system and its environment). Meanwhile, stochastic games are often used to model reactive processes. With the development of the intelligent industry, these theories are extensively used in robot patrolling, intelligent logistics, and intelligent transportation. However, it is still challenging to find a practically feasible synthesis algorithm and generate the optimal system according to the existing research. Thus, it is desirable to design an incentive mechanism to motivate the system to fulfill given specifications. This work studies the learning-based approach for strategy synthesis of reward asynchronous probabilistic games against linear temporal logic (LTL) specifications in a probabilistic environment. An asynchronous reward mechanism is proposed to motivate players to gain maximized rewards by their positions and choose actions. Based on this mechanism, the techniques of the learning theory can be applied to transform the synthesis problem into the problem of computing the expected rewards. Then, it is proven that the reinforcement learning algorithm provides the optimal strategies that maximize the expected cumulative reward of the satisfaction of an LTL specification asymptotically. Finally, our techniques are implemented, and their effectiveness is illustrated by two case studies of robot patrolling and autonomous driving.

摘要

传统的综合问题通常通过构建一个满足给定规格的系统来解决。该系统不断与环境交互并与环境相对。这个问题可以进一步看作是解决一个双人博弈(系统及其环境)。同时,随机博弈常被用于对反应过程进行建模。随着智能产业的发展,这些理论在机器人巡逻、智能物流和智能交通中得到了广泛应用。然而,根据现有研究,找到一种实际可行的综合算法并生成最优系统仍然具有挑战性。因此,期望设计一种激励机制来促使系统满足给定规格。这项工作研究了在概率环境中针对线性时态逻辑(LTL)规格的奖励异步概率博弈的基于学习的策略综合方法。提出了一种异步奖励机制,以激励玩家根据其位置获得最大化奖励并选择行动。基于此机制,学习理论的技术可应用于将综合问题转化为计算期望奖励的问题。然后,证明了强化学习算法渐近地提供使LTL规格满意度的期望累积奖励最大化的最优策略。最后,实现了我们的技术,并通过机器人巡逻和自动驾驶的两个案例研究说明了它们的有效性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e828/9455281/6f444b801c0e/peerj-cs-08-1094-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e828/9455281/a34ed10603ed/peerj-cs-08-1094-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e828/9455281/1ea41774c0a9/peerj-cs-08-1094-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e828/9455281/922c8bfe8961/peerj-cs-08-1094-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e828/9455281/6f444b801c0e/peerj-cs-08-1094-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e828/9455281/a34ed10603ed/peerj-cs-08-1094-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e828/9455281/1ea41774c0a9/peerj-cs-08-1094-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e828/9455281/922c8bfe8961/peerj-cs-08-1094-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e828/9455281/6f444b801c0e/peerj-cs-08-1094-g004.jpg

相似文献

1
A learning-based synthesis approach of reward asynchronous probabilistic games against the linear temporal logic winning condition.一种基于学习的针对线性时态逻辑获胜条件的奖励异步概率博弈合成方法。
PeerJ Comput Sci. 2022 Sep 5;8:e1094. doi: 10.7717/peerj-cs.1094. eCollection 2022.
2
Safe reinforcement learning under temporal logic with reward design and quantum action selection.基于奖励设计和量子动作选择的时序逻辑下的安全强化学习。
Sci Rep. 2023 Feb 2;13(1):1925. doi: 10.1038/s41598-023-28582-4.
3
A formal methods approach to interpretable reinforcement learning for robotic planning.
Sci Robot. 2019 Dec 18;4(37). doi: 10.1126/scirobotics.aay6276.
4
Momentary subjective well-being depends on learning and not reward.瞬间主观幸福感取决于学习而非奖励。
Elife. 2020 Nov 17;9:e57977. doi: 10.7554/eLife.57977.
5
Mobile Robot Networks for Environmental Monitoring: A Cooperative Receding Horizon Temporal Logic Control Approach.移动机器人网络用于环境监测:一种合作的滚动时域时序逻辑控制方法。
IEEE Trans Cybern. 2019 Feb;49(2):698-711. doi: 10.1109/TCYB.2018.2879905. Epub 2018 Nov 19.
6
An approach to solving optimal control problems of nonlinear systems by introducing detail-reward mechanism in deep reinforcement learning.引入细节奖励机制的深度强化学习方法解决非线性系统最优控制问题。
Math Biosci Eng. 2022 Jun 23;19(9):9258-9290. doi: 10.3934/mbe.2022430.
7
Safe Decision Controller for Autonomous DrivingBased on Deep Reinforcement Learning inNondeterministic Environment.基于深度强化学习的不确定环境自动驾驶安全决策控制器
Sensors (Basel). 2023 Jan 20;23(3):1198. doi: 10.3390/s23031198.
8
A Collaborative Multiagent Reinforcement Learning Method Based on Policy Gradient Potential.一种基于策略梯度势的协作多智能体强化学习方法。
IEEE Trans Cybern. 2021 Feb;51(2):1015-1027. doi: 10.1109/TCYB.2019.2932203. Epub 2021 Jan 15.
9
Optimal Policy of Multiplayer Poker via Actor-Critic Reinforcement Learning.通过演员-评论家强化学习实现多人扑克的最优策略
Entropy (Basel). 2022 May 30;24(6):774. doi: 10.3390/e24060774.
10
Learning to maximize reward rate: a model based on semi-Markov decision processes.学习最大化奖励率:基于半马尔可夫决策过程的模型。
Front Neurosci. 2014 May 23;8:101. doi: 10.3389/fnins.2014.00101. eCollection 2014.

本文引用的文献

1
Stochastic Games.随机博弈
Proc Natl Acad Sci U S A. 1953 Oct;39(10):1095-100. doi: 10.1073/pnas.39.10.1095.