California Institute of Technology, Pasadena, USA.
Sci Rep. 2024 Mar 21;14(1):6858. doi: 10.1038/s41598-024-56828-2.
The ability to understand and manipulate numbers and quantities emerges during childhood, but the mechanism through which humans acquire and develop this ability is still poorly understood. We explore this question through a model, assuming that the learner is able to pick up and place small objects from, and to, locations of its choosing, and will spontaneously engage in such undirected manipulation. We further assume that the learner's visual system will monitor the changing arrangements of objects in the scene and will learn to predict the effects of each action by comparing perception with a supervisory signal from the motor system. We model perception using standard deep networks for feature extraction and classification. Our main finding is that, from learning the task of action prediction, an unexpected image representation emerges exhibiting regularities that foreshadow the perception and representation of numbers and quantity. These include distinct categories for zero and the first few natural numbers, a strict ordering of the numbers, and a one-dimensional signal that correlates with numerical quantity. As a result, our model acquires the ability to estimate numerosity, i.e. the number of objects in the scene, as well as subitization, i.e. the ability to recognize at a glance the exact number of objects in small scenes. Remarkably, subitization and numerosity estimation extrapolate to scenes containing many objects, far beyond the three objects used during training. We conclude that important aspects of a facility with numbers and quantities may be learned with supervision from a simple pre-training task. Our observations suggest that cross-modal learning is a powerful learning mechanism that may be harnessed in artificial intelligence.
理解和操作数字和数量的能力在儿童时期出现,但人类获得和发展这种能力的机制仍知之甚少。我们通过一个模型来探索这个问题,假设学习者能够从其选择的位置拿起和放置小物体,并自发地进行这种无指导的操作。我们进一步假设学习者的视觉系统将监测场景中物体的不断变化的排列,并通过将感知与来自运动系统的监督信号进行比较来学习预测每个动作的效果。我们使用用于特征提取和分类的标准深度网络来模拟感知。我们的主要发现是,通过学习动作预测任务,会出现一种意想不到的图像表示,其中表现出的规律性预示着数字和数量的感知和表示。这些包括零和前几个自然数的独特类别、数字的严格排序以及与数字数量相关的一维信号。因此,我们的模型获得了估计数量的能力,即场景中物体的数量,以及一眼就能识别小场景中确切物体数量的能力,即亚比计数能力。值得注意的是,亚比计数和数量估计可以扩展到包含许多物体的场景,远远超出训练中使用的三个物体。我们得出的结论是,数字和数量方面的重要方面可能可以通过简单的预训练任务的监督来学习。我们的观察表明,跨模态学习是一种强大的学习机制,可以在人工智能中利用。