Department of Computer Science, School of Mathematics, Statistics, and Computer Science, University of Tehran, Tehran, Iran.
CERCO UMR 5549, CNRS - Université de Toulouse, F-31300, France.
Sci Rep. 2016 Sep 7;6:32672. doi: 10.1038/srep32672.
Deep convolutional neural networks (DCNNs) have attracted much attention recently, and have shown to be able to recognize thousands of object categories in natural image databases. Their architecture is somewhat similar to that of the human visual system: both use restricted receptive fields, and a hierarchy of layers which progressively extract more and more abstracted features. Yet it is unknown whether DCNNs match human performance at the task of view-invariant object recognition, whether they make similar errors and use similar representations for this task, and whether the answers depend on the magnitude of the viewpoint variations. To investigate these issues, we benchmarked eight state-of-the-art DCNNs, the HMAX model, and a baseline shallow model and compared their results to those of humans with backward masking. Unlike in all previous DCNN studies, we carefully controlled the magnitude of the viewpoint variations to demonstrate that shallow nets can outperform deep nets and humans when variations are weak. When facing larger variations, however, more layers were needed to match human performance and error distributions, and to have representations that are consistent with human behavior. A very deep net with 18 layers even outperformed humans at the highest variation level, using the most human-like representations.
深度卷积神经网络(DCNN)最近受到了广泛关注,它们在自然图像数据库中能够识别数千种物体类别。其结构与人类视觉系统有些相似:两者都使用受限的感受野,并通过分层结构逐步提取越来越抽象的特征。然而,目前尚不清楚 DCNN 在视图不变性物体识别任务中的表现是否与人类相当,它们在该任务中是否会犯类似的错误并使用类似的表示,以及答案是否取决于视角变化的大小。为了研究这些问题,我们对八个最先进的 DCNN、HMAX 模型和一个基线浅层模型进行了基准测试,并将它们的结果与使用后向掩蔽的人类进行了比较。与之前所有的 DCNN 研究不同,我们仔细控制了视角变化的大小,以证明在变化较弱时,浅层网络可以胜过深层网络和人类。然而,当面临更大的变化时,需要更多的层才能匹配人类的表现和错误分布,并具有与人类行为一致的表示。一个具有 18 层的非常深的网络甚至在最高变化水平上超过了人类,使用了最像人类的表示。