关于可操纵物体的细粒度知识可以通过对比语言图像预训练得到很好的预测。

Fine-grained knowledge about manipulable objects is well-predicted by contrastive language image pre-training.

作者信息

Walbrin Jon, Sossounov Nikita, Mahdiani Morteza, Vaz Igor, Almeida Jorge

机构信息

Proaction Laboratory, Faculty of Psychology and Educational Sciences, University of Coimbra, Coimbra, Portugal.

CINEICC, Faculty of Psychology and Educational Sciences, University of Coimbra, Coimbra, Portugal.

出版信息

iScience. 2024 Jun 17;27(7):110297. doi: 10.1016/j.isci.2024.110297. eCollection 2024 Jul 19.

DOI:10.1016/j.isci.2024.110297

PMID:39040066

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11261149/

Abstract

Object recognition is an important ability that relies on distinguishing between similar objects (e.g., deciding which utensil(s) to use at different stages of meal preparation). Recent work describes the fine-grained organization of knowledge about manipulable objects via the study of the constituent dimensions that are most relevant to human behavior, for example, vision, manipulation, and function-based properties. A logical extension of this work concerns whether or not these dimensions are uniquely human, or can be approximated by deep learning. Here, we show that behavioral dimensions are generally well-predicted by CLIP-ViT - a multimodal network trained on a large and diverse set of image-text pairs. Moreover, this model outperforms comparison networks pre-trained on smaller, image-only datasets. These results demonstrate the impressive capacity of CLIP-ViT to approximate fine-grained object knowledge. We discuss the possible sources of this benefit relative to other models (e.g., multimodal vs. image-only pre-training, dataset size, architecture).

摘要

物体识别是一项重要能力，它依赖于区分相似物体（例如，在 meal preparation 的不同阶段决定使用哪种器具）。最近的研究通过对与人类行为最相关的构成维度（例如视觉、操作和基于功能的属性）的研究，描述了关于可操作物体的细粒度知识组织。这项工作的一个合理延伸是这些维度是否是人类独有的，或者能否通过深度学习来近似。在这里，我们表明行为维度通常能被 CLIP-ViT 很好地预测——CLIP-ViT 是一个在大量多样的图像-文本对上训练的多模态网络。此外，该模型优于在较小的仅图像数据集上预训练的比较网络。这些结果证明了 CLIP-ViT 在近似细粒度物体知识方面的惊人能力。我们讨论了相对于其他模型（例如多模态与仅图像预训练、数据集大小、架构）这种优势的可能来源。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

关于可操纵物体的细粒度知识可以通过对比语言图像预训练得到很好的预测。

Fine-grained knowledge about manipulable objects is well-predicted by contrastive language image pre-training.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

关于可操纵物体的细粒度知识可以通过对比语言图像预训练得到很好的预测。

Fine-grained knowledge about manipulable objects is well-predicted by contrastive language image pre-training.

作者信息

机构信息

出版信息

相似文献

本文引用的文献