Department of Electrical and Computer Engineering, University of California, Los Angeles, CA 90095.
Department of Electrical Engineering, Stanford University, Stanford, CA 94305
Proc Natl Acad Sci U S A. 2019 Jan 2;116(1):96-105. doi: 10.1073/pnas.1802103115. Epub 2018 Dec 17.
Despite significant recent progress, machine vision systems lag considerably behind their biological counterparts in performance, scalability, and robustness. A distinctive hallmark of the brain is its ability to automatically discover and model objects, at multiscale resolutions, from repeated exposures to unlabeled contextual data and then to be able to robustly detect the learned objects under various nonideal circumstances, such as partial occlusion and different view angles. Replication of such capabilities in a machine would require three key ingredients: () access to large-scale perceptual data of the kind that humans experience, () flexible representations of objects, and () an efficient unsupervised learning algorithm. The Internet fortunately provides unprecedented access to vast amounts of visual data. This paper leverages the availability of such data to develop a scalable framework for unsupervised learning of object prototypes-brain-inspired flexible, scale, and shift invariant representations of deformable objects (e.g., humans, motorcycles, cars, airplanes) comprised of parts, their different configurations and views, and their spatial relationships. Computationally, the object prototypes are represented as geometric associative networks using probabilistic constructs such as Markov random fields. We apply our framework to various datasets and show that our approach is computationally scalable and can construct accurate and operational part-aware object models much more efficiently than in much of the recent computer vision literature. We also present efficient algorithms for detection and localization in new scenes of objects and their partial views.
尽管最近取得了重大进展,但机器视觉系统在性能、可扩展性和鲁棒性方面仍远远落后于生物系统。大脑的一个显著特点是它能够自动发现和建模对象,以多尺度分辨率,从重复暴露于未标记的上下文数据中,然后能够在各种不理想的情况下稳健地检测到学习到的对象,例如部分遮挡和不同的视角。在机器中复制这种能力需要三个关键要素:(1)能够访问人类所经历的那种大规模感知数据;(2)灵活的对象表示;(3)高效的无监督学习算法。互联网幸运地为我们提供了对大量视觉数据的前所未有的访问。本文利用这些数据的可用性,开发了一个用于无监督学习对象原型的可扩展框架,这些原型是受大脑启发的灵活的、尺度和位移不变的可变形对象(例如人、摩托车、汽车、飞机)的表示,包括它们的不同配置和视图,以及它们的空间关系。在计算上,对象原型使用概率结构(如马尔可夫随机场)表示为几何关联网络。我们将我们的框架应用于各种数据集,并表明我们的方法在计算上是可扩展的,并且可以比最近的许多计算机视觉文献更有效地构建准确和可操作的部分感知对象模型。我们还提出了用于新场景中对象及其部分视图的检测和定位的高效算法。