Van de Maele Toon, Verbelen Tim, Çatal Ozan, Dhoedt Bart
IDLab, Department of Information Technology, Ghent University - imec, Ghent, Belgium.
Front Neurorobot. 2022 Apr 14;16:840658. doi: 10.3389/fnbot.2022.840658. eCollection 2022.
Scene understanding and decomposition is a crucial challenge for intelligent systems, whether it is for object manipulation, navigation, or any other task. Although current machine and deep learning approaches for object detection and classification obtain high accuracy, they typically do not leverage interaction with the world and are limited to a set of objects seen during training. Humans on the other hand learn to recognize and classify different objects by actively engaging with them on first encounter. Moreover, recent theories in neuroscience suggest that cortical columns in the neocortex play an important role in this process, by building predictive models about objects in their reference frame. In this article, we present an enactive embodied agent that implements such a generative model for object interaction. For each object category, our system instantiates a deep neural network, called Cortical Column Network (CCN), that represents the object in its own reference frame by learning a generative model that predicts the expected transform in pixel space, given an action. The model parameters are optimized through the active inference paradigm, i.e., the minimization of variational free energy. When provided with a visual observation, an ensemble of CCNs each vote on their belief of observing that specific object category, yielding a potential object classification. In case the likelihood on the selected category is too low, the object is detected as an unknown category, and the agent has the ability to instantiate a novel CCN for this category. We validate our system in an simulated environment, where it needs to learn to discern multiple objects from the YCB dataset. We show that classification accuracy improves as an embodied agent can gather more evidence, and that it is able to learn about novel, previously unseen objects. Finally, we show that an agent driven through active inference can choose their actions to reach a preferred observation.
场景理解与分解是智能系统面临的一项关键挑战,无论是对于物体操纵、导航还是任何其他任务而言。尽管当前用于物体检测和分类的机器学习及深度学习方法取得了很高的准确率,但它们通常没有利用与世界的交互,并且局限于训练期间见过的一组物体。另一方面,人类通过首次接触时与物体积极互动来学习识别和分类不同的物体。此外,神经科学的最新理论表明,新皮层中的皮质柱在这一过程中发挥着重要作用,通过在其参考框架内构建关于物体的预测模型。在本文中,我们提出了一种具身能动智能体,它实现了这样一种用于物体交互的生成模型。对于每个物体类别,我们的系统实例化一个深度神经网络,称为皮质柱网络(CCN),该网络通过学习一个生成模型来在其自身参考框架内表示物体,该生成模型在给定一个动作的情况下预测像素空间中的预期变换。模型参数通过主动推理范式进行优化,即变分自由能的最小化。当提供视觉观察时,一组CCN各自就其观察到该特定物体类别的信念进行投票,从而产生一个潜在的物体分类。如果所选类别的似然性过低,则将该物体检测为未知类别,并且智能体有能力为该类别实例化一个新的CCN。我们在一个模拟环境中验证了我们的系统,在该环境中它需要学习从YCB数据集中辨别多个物体。我们表明,随着具身智能体能够收集更多证据,分类准确率会提高,并且它能够学习关于新的、以前未见过的物体。最后,我们表明通过主动推理驱动的智能体可以选择其动作以达到一个偏好的观察结果。