Walbrin Jon, Sossounov Nikita, Mahdiani Morteza, Vaz Igor, Almeida Jorge
Proaction Laboratory, Faculty of Psychology and Educational Sciences, University of Coimbra, Coimbra, Portugal.
CINEICC, Faculty of Psychology and Educational Sciences, University of Coimbra, Coimbra, Portugal.
iScience. 2024 Jun 17;27(7):110297. doi: 10.1016/j.isci.2024.110297. eCollection 2024 Jul 19.
Object recognition is an important ability that relies on distinguishing between similar objects (e.g., deciding which utensil(s) to use at different stages of meal preparation). Recent work describes the fine-grained organization of knowledge about manipulable objects via the study of the constituent dimensions that are most relevant to human behavior, for example, vision, manipulation, and function-based properties. A logical extension of this work concerns whether or not these dimensions are uniquely human, or can be approximated by deep learning. Here, we show that behavioral dimensions are generally well-predicted by CLIP-ViT - a multimodal network trained on a large and diverse set of image-text pairs. Moreover, this model outperforms comparison networks pre-trained on smaller, image-only datasets. These results demonstrate the impressive capacity of CLIP-ViT to approximate fine-grained object knowledge. We discuss the possible sources of this benefit relative to other models (e.g., multimodal vs. image-only pre-training, dataset size, architecture).
物体识别是一项重要能力,它依赖于区分相似物体(例如,在 meal preparation 的不同阶段决定使用哪种器具)。最近的研究通过对与人类行为最相关的构成维度(例如视觉、操作和基于功能的属性)的研究,描述了关于可操作物体的细粒度知识组织。这项工作的一个合理延伸是这些维度是否是人类独有的,或者能否通过深度学习来近似。在这里,我们表明行为维度通常能被 CLIP-ViT 很好地预测——CLIP-ViT 是一个在大量多样的图像-文本对上训练的多模态网络。此外,该模型优于在较小的仅图像数据集上预训练的比较网络。这些结果证明了 CLIP-ViT 在近似细粒度物体知识方面的惊人能力。我们讨论了相对于其他模型(例如多模态与仅图像预训练、数据集大小、架构)这种优势的可能来源。