Pylyshyn Z W
Rutgers Center for Cognitive Science, Rutgers University, Psychology Building, New Wing, Busch Campus, New Brunswick, NJ 08903, USA.
Cognition. 2001 Jun;80(1-2):127-58. doi: 10.1016/s0010-0277(00)00156-6.
This paper argues that a theory of situated vision, suited for the dual purposes of object recognition and the control of action, will have to provide something more than a system that constructs a conceptual representation from visual stimuli: it will also need to provide a special kind of direct (preconceptual, unmediated) connection between elements of a visual representation and certain elements in the world. Like natural language demonstratives (such as 'this' or 'that') this direct connection allows entities to be referred to without being categorized or conceptualized. Several reasons are given for why we need such a preconceptual mechanism which individuates and keeps track of several individual objects in the world. One is that early vision must pick out and compute the relation among several individual objects while ignoring their properties. Another is that incrementally computing and updating representations of a dynamic scene requires keeping track of token individuals despite changes in their properties or locations. It is then noted that a mechanism meeting these requirements has already been proposed in order to account for a number of disparate empirical phenomena, including subitizing, search-subset selection and multiple object tracking (Pylyshyn et al., Canadian Journal of Experimental Psychology 48(2) (1994) 260). This mechanism, called a visual index or FINST, is briefly discussed and it is argued that viewing it as performing a demonstrative or preconceptual reference function has far-reaching implications not only for a theory of situated vision, but also for suggesting a new way to look at why the primitive individuation of visual objects, or proto-objects, is so central in computing visual representations. Indexing visual objects is also, according to this view, the primary means for grounding visual concepts and is a potentially fruitful way to look at the problem of visual integration across time and across saccades, as well as to explain how infants' numerical capacity might arise.
本文认为,一种适用于物体识别和动作控制双重目的的情境视觉理论,所提供的不应仅仅是一个从视觉刺激构建概念表征的系统:它还需要在视觉表征的元素与世界中的某些元素之间提供一种特殊的直接(前概念的、无中介的)联系。就像自然语言中的指示词(如“这个”或“那个”)一样,这种直接联系允许在不对实体进行分类或概念化的情况下对其进行指称。文中给出了几个理由来说明我们为何需要这样一种前概念机制,它能够区分并追踪世界中的多个个体对象。其一,早期视觉必须挑选出并计算多个个体对象之间的关系,而忽略它们的属性。其二,对动态场景进行逐步计算和更新表征,需要在个体对象的属性或位置发生变化时仍能追踪它们。随后指出,为了解释包括数感、搜索子集选择和多目标追踪等一系列不同的实证现象,已经有人提出了一种满足这些要求的机制(Pylyshyn等人,《加拿大实验心理学杂志》48(2) (1994) 260)。这种机制被称为视觉索引或FINST,本文将对其进行简要讨论,并认为将其视为执行指示性或前概念性指称功能,不仅对情境视觉理论具有深远意义,还为探讨为何视觉对象的原始个体化或原对象在计算视觉表征中如此核心提供了一种新的视角。根据这种观点,对视觉对象进行索引也是为视觉概念奠定基础的主要手段,并且是一种看待跨时间和跨扫视的视觉整合问题的潜在有效方式,同时也有助于解释婴儿的数字能力是如何产生的。