一种双中央凹-周边视觉处理模型实现了高效的扫视选择。

A dual foveal-peripheral visual processing model implements efficient saccade selection.

作者信息

Daucé Emmanuel, Albiges Pierre, Perrinet Laurent U

机构信息

Institut de Neurosciences de la Timone (UMR 7289), Aix Marseille University, CNRS, Marseille, France.

出版信息

J Vis. 2020 Aug 3;20(8):22. doi: 10.1167/jov.20.8.22.

DOI:10.1167/jov.20.8.22

PMID:38755789

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7443118/

Abstract

We develop a visuomotor model that implements visual search as a focal accuracy-seeking policy, with the target's position and category drawn independently from a common generative process. Consistently with the anatomical separation between the ventral versus dorsal pathways, the model is composed of two pathways that respectively infer what to see and where to look. The "What" network is a classical deep learning classifier that only processes a small region around the center of fixation, providing a "foveal" accuracy. In contrast, the "Where" network processes the full visual field in a biomimetic fashion, using a log-polar retinotopic encoding, which is preserved up to the action selection level. In our model, the foveal accuracy is used as a monitoring signal to train the "Where" network, much like in the "actor/critic" framework. After training, the "Where" network provides an "accuracy map" that serves to guide the eye toward peripheral objects. Finally, the comparison of both networks' accuracies amounts to either selecting a saccade or keeping the eye focused at the center to identify the target. We test this setup on a simple task of finding a digit in a large, cluttered image. Our simulation results demonstrate the effectiveness of this approach, increasing by one order of magnitude the radius of the visual field toward which the agent can detect and recognize a target, either through a single saccade or with multiple ones. Importantly, our log-polar treatment of the visual information exploits the strong compression rate performed at the sensory level, providing ways to implement visual search in a sublinear fashion, in contrast with mainstream computer vision.

摘要

我们开发了一种视觉运动模型，该模型将视觉搜索实现为一种聚焦于寻求准确性的策略，目标的位置和类别从一个共同的生成过程中独立抽取。与腹侧和背侧通路之间的解剖学分离一致，该模型由两条通路组成，分别推断要看什么和看哪里。“什么”网络是一个经典的深度学习分类器，仅处理注视中心周围的小区域，提供“中央凹”准确性。相比之下，“哪里”网络以仿生方式处理整个视野，使用对数极坐标视网膜拓扑编码，这种编码一直保留到动作选择层面。在我们的模型中，中央凹准确性用作训练“哪里”网络的监测信号，这与“演员/评论家”框架非常相似。训练后，“哪里”网络提供一个“准确性地图”，用于引导眼睛看向周边物体。最后，比较两个网络的准确性相当于选择一个扫视动作或保持眼睛聚焦在中心以识别目标。我们在一个在大的、杂乱图像中寻找数字的简单任务上测试了这个设置。我们的模拟结果证明了这种方法的有效性，通过单次扫视或多次扫视，将智能体能够检测和识别目标的视野半径提高了一个数量级。重要的是，我们对视觉信息的对数极坐标处理利用了在感官层面执行的强大压缩率，提供了以亚线性方式实现视觉搜索的方法，这与主流计算机视觉不同。