Liang Enming, Wen Kexin, Lam William H K, Sumalee Agachai, Zhong Renxin
IEEE Trans Neural Netw Learn Syst. 2022 Sep;33(9):4742-4756. doi: 10.1109/TNNLS.2021.3060187. Epub 2022 Aug 31.
Balancing the supply and demand for ride-sourcing companies is a challenging issue, especially with real-time requests and stochastic traffic conditions of large-scale congested road networks. To tackle this challenge, this article proposes a robust and scalable approach that integrates reinforcement learning (RL) and a centralized programming (CP) structure to promote real-time taxi operations. Both real-time order matching decisions and vehicle relocation decisions at the microscopic network scale are integrated within a Markov decision process framework. The RL component learns the decomposed state-value function, which represents the taxi drivers' experience, the off-line historical demand pattern, and the traffic network congestion. The CP component plans nonmyopic decisions for drivers collectively under the prescribed system constraints to explicitly realize cooperation. Furthermore, to circumvent sparse reward and sample imbalance problems over the microscopic road network, this article proposed a temporal-difference learning algorithm with prioritized gradient descent and adaptive exploration techniques. A simulator is built and trained with the Manhattan road network and New York City yellow taxi data to simulate the real-time vehicle dispatching environment. Both centralized and decentralized taxi dispatching policies are examined with the simulator. This case study shows that the proposed approach can further improve taxi drivers' profits while reducing customers' waiting times compared to several existing vehicle dispatching algorithms.
平衡叫车公司的供需是一个具有挑战性的问题,尤其是在大规模拥堵道路网络的实时请求和随机交通状况下。为应对这一挑战,本文提出了一种强大且可扩展的方法,该方法将强化学习(RL)和集中式规划(CP)结构相结合,以促进实时出租车运营。微观网络层面的实时订单匹配决策和车辆重新定位决策都被整合到一个马尔可夫决策过程框架内。强化学习组件学习分解后的状态值函数,该函数代表出租车司机的经验、离线历史需求模式以及交通网络拥堵情况。集中式规划组件在规定的系统约束下为司机集体规划非近视决策,以明确实现合作。此外,为规避微观道路网络上的稀疏奖励和样本不平衡问题,本文提出了一种带有优先梯度下降和自适应探索技术的时间差分学习算法。利用曼哈顿道路网络和纽约市黄色出租车数据构建并训练了一个模拟器,以模拟实时车辆调度环境。通过该模拟器对集中式和分散式出租车调度策略进行了检验。该案例研究表明,与几种现有的车辆调度算法相比,所提出的方法在减少客户等待时间的同时,还能进一步提高出租车司机的利润。