IDSIA, Dalle Molle Institute for Artificial Intelligence, Università della Svizzera Italiana-Scuola Universitaria Professionale della Svizzera Italiana (USI-SUPSI) Lugano, Switzerland.
Front Psychol. 2013 Nov 26;4:833. doi: 10.3389/fpsyg.2013.00833. eCollection 2013.
A reinforcement learning agent that autonomously explores its environment can utilize a curiosity drive to enable continual learning of skills, in the absence of any external rewards. We formulate curiosity-driven exploration, and eventual skill acquisition, as a selective sampling problem. Each environment setting provides the agent with a stream of instances. An instance is a sensory observation that, when queried, causes an outcome that the agent is trying to predict. After an instance is observed, a query condition, derived herein, tells whether its outcome is statistically known or unknown to the agent, based on the confidence interval of an online linear classifier. Upon encountering the first unknown instance, the agent "queries" the environment to observe the outcome, which is expected to improve its confidence in the corresponding predictor. If the environment is in a setting where all instances are known, the agent generates a plan of actions to reach a new setting, where an unknown instance is likely to be encountered. The desired setting is a self-generated goal, and the plan of action, essentially a program to solve a problem, is a skill. The success of the plan depends on the quality of the agent's predictors, which are improved as mentioned above. For validation, this method is applied to both a simulated and real Katana robot arm in its "blocks-world" environment. Results show that the proposed method generates sample-efficient curious exploration behavior, which exhibits developmental stages, continual learning, and skill acquisition, in an intrinsically-motivated playful agent.
一个能够自主探索环境的强化学习代理可以利用好奇心驱动来实现技能的持续学习,而无需任何外部奖励。我们将好奇心驱动的探索和最终的技能获取表述为一个选择性采样问题。每个环境设置为代理提供了一个实例流。一个实例是一个感官观察,当被查询时,会产生一个代理试图预测的结果。观察到一个实例后,根据在线线性分类器的置信区间,本文提出的查询条件会告诉代理其结果对于代理来说是已知的还是未知的。在遇到第一个未知实例后,代理会“查询”环境以观察结果,这有望提高其对应预测器的置信度。如果环境处于所有实例都已知的设置中,代理会生成行动计划以达到新的设置,在新的设置中很可能会遇到未知实例。期望的设置是代理自生成的目标,而行动计划,本质上是解决问题的程序,是一种技能。行动计划的成功取决于代理预测器的质量,如前所述,预测器的质量会得到提高。为了验证,该方法应用于模拟和真实的 Katana 机器臂在其“积木世界”环境中。结果表明,所提出的方法生成了具有探索性的样本,表现出发展阶段、持续学习和技能获取,这是一个内在激励的、好玩的代理。