Ben-Yosef Guy, Ullman Shimon
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
Interface Focus. 2018 Aug 6;8(4):20180020. doi: 10.1098/rsfs.2018.0020. Epub 2018 Jun 15.
Computational models of vision have advanced in recent years at a rapid rate, rivalling in some areas human-level performance. Much of the progress to date has focused on analysing the visual scene at the object level-the recognition and localization of objects in the scene. Human understanding of images reaches a richer and deeper image understanding both 'below' the object level, such as identifying and localizing object parts and sub-parts, as well as 'above' the object level, such as identifying object relations, and agents with their actions and interactions. In both cases, understanding depends on recovering meaningful structures in the image, and their components, properties and inter-relations, a process referred here as 'image interpretation'. In this paper, we describe recent directions, based on human and computer vision studies, towards human-like image interpretation, beyond the reach of current schemes, both below the object level, as well as some aspects of image interpretation at the level of meaningful configurations beyond the recognition of individual objects, and in particular, interactions between two people in close contact. In both cases the recognition process depends on the detailed interpretation of so-called 'minimal images', and at both levels recognition depends on combining 'bottom-up' processing, proceeding from low to higher levels of a processing hierarchy, together with 'top-down' processing, proceeding from high to lower levels stages of visual analysis.
近年来,视觉计算模型发展迅速,在某些领域已可媲美人类水平的表现。迄今为止,大部分进展都集中在物体层面分析视觉场景,即识别和定位场景中的物体。人类对图像的理解在物体层面“之下”更为丰富和深入,比如识别和定位物体的部分及子部分,在物体层面“之上”也是如此,比如识别物体关系以及带有其动作和交互的主体。在这两种情况下,理解都依赖于恢复图像中有意义的结构及其组成部分、属性和相互关系,此过程在这里称为“图像解释”。在本文中,我们描述基于人类和计算机视觉研究的最新方向,以实现超越当前方案的类人图像解释,这不仅包括物体层面之下的情况,还包括超越单个物体识别的有意义配置层面的图像解释的某些方面,特别是两人密切接触时的交互。在这两种情况下,识别过程都依赖于对所谓“最小图像”的详细解释,并且在两个层面上,识别都依赖于将从处理层次结构的低到高层面进行的“自下而上”处理与从视觉分析的高到低层面阶段进行的“自上而下”处理相结合。