Zerroug Aimen, Vaishnav Mohit, Colin Julien, Musslick Sebastian, Serre Thomas
Artificial and Natural Intelligence Toulouse Institute, Université de Toulouse, France.
Carney Institute for Brain Science, Dept. of Cognitive Linguistic & Psychological Sciences Brown University, Providence, RI 02912.
Adv Neural Inf Process Syst. 2022 Dec;35(DB):29776-29788.
A fundamental component of human vision is our ability to parse complex visual scenes and judge the relations between their constituent objects. AI benchmarks for visual reasoning have driven rapid progress in recent years with state-of-the-art systems now reaching human accuracy on some of these benchmarks. Yet, there remains a major gap between humans and AI systems in terms of the sample efficiency with which they learn new visual reasoning tasks. Humans' remarkable efficiency at learning has been at least partially attributed to their ability to harness compositionality - allowing them to efficiently take advantage of previously gained knowledge when learning new tasks. Here, we introduce a novel visual reasoning benchmark, Compositional Visual Relations (CVR), to drive progress towards the development of more data-efficient learning algorithms. We take inspiration from fluid intelligence and non-verbal reasoning tests and describe a novel method for creating compositions of abstract rules and generating image datasets corresponding to these rules at scale. Our proposed benchmark includes measures of sample efficiency, generalization, compositionality, and transfer across task rules. We systematically evaluate modern neural architectures and find that convolutional architectures surpass transformer-based architectures across all performance measures in most data regimes. However, all computational models are much less data efficient than humans, even after learning informative visual representations using self-supervision. Overall, we hope our challenge will spur interest in developing neural architectures that can learn to harness compositionality for more efficient learning.
人类视觉的一个基本组成部分是我们解析复杂视觉场景并判断其组成对象之间关系的能力。近年来,视觉推理的人工智能基准测试推动了快速进展,目前最先进的系统在其中一些基准测试上已达到人类的准确率。然而,在学习新的视觉推理任务的样本效率方面,人类与人工智能系统之间仍存在重大差距。人类在学习方面的卓越效率至少部分归因于他们利用组合性的能力——使他们在学习新任务时能够有效地利用先前获得的知识。在此,我们引入了一个新颖的视觉推理基准测试,即组合视觉关系(CVR),以推动更具数据效率的学习算法的发展。我们从流体智力和非语言推理测试中汲取灵感,并描述了一种创建抽象规则组合并大规模生成与这些规则对应的图像数据集的新方法。我们提出的基准测试包括样本效率、泛化、组合性以及跨任务规则迁移的度量。我们系统地评估了现代神经架构,发现在大多数数据情况下,卷积架构在所有性能指标上都超过了基于Transformer的架构。然而,即使在使用自监督学习了信息丰富的视觉表示之后,所有计算模型的数据效率仍远低于人类。总体而言,我们希望我们的挑战能够激发人们对开发能够学习利用组合性以实现更高效学习的神经架构的兴趣。