• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

关于可操纵物体的细粒度知识可以通过对比语言图像预训练得到很好的预测。

Fine-grained knowledge about manipulable objects is well-predicted by contrastive language image pre-training.

作者信息

Walbrin Jon, Sossounov Nikita, Mahdiani Morteza, Vaz Igor, Almeida Jorge

机构信息

Proaction Laboratory, Faculty of Psychology and Educational Sciences, University of Coimbra, Coimbra, Portugal.

CINEICC, Faculty of Psychology and Educational Sciences, University of Coimbra, Coimbra, Portugal.

出版信息

iScience. 2024 Jun 17;27(7):110297. doi: 10.1016/j.isci.2024.110297. eCollection 2024 Jul 19.

DOI:10.1016/j.isci.2024.110297
PMID:39040066
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11261149/
Abstract

Object recognition is an important ability that relies on distinguishing between similar objects (e.g., deciding which utensil(s) to use at different stages of meal preparation). Recent work describes the fine-grained organization of knowledge about manipulable objects via the study of the constituent dimensions that are most relevant to human behavior, for example, vision, manipulation, and function-based properties. A logical extension of this work concerns whether or not these dimensions are uniquely human, or can be approximated by deep learning. Here, we show that behavioral dimensions are generally well-predicted by CLIP-ViT - a multimodal network trained on a large and diverse set of image-text pairs. Moreover, this model outperforms comparison networks pre-trained on smaller, image-only datasets. These results demonstrate the impressive capacity of CLIP-ViT to approximate fine-grained object knowledge. We discuss the possible sources of this benefit relative to other models (e.g., multimodal vs. image-only pre-training, dataset size, architecture).

摘要

物体识别是一项重要能力,它依赖于区分相似物体(例如,在 meal preparation 的不同阶段决定使用哪种器具)。最近的研究通过对与人类行为最相关的构成维度(例如视觉、操作和基于功能的属性)的研究,描述了关于可操作物体的细粒度知识组织。这项工作的一个合理延伸是这些维度是否是人类独有的,或者能否通过深度学习来近似。在这里,我们表明行为维度通常能被 CLIP-ViT 很好地预测——CLIP-ViT 是一个在大量多样的图像-文本对上训练的多模态网络。此外,该模型优于在较小的仅图像数据集上预训练的比较网络。这些结果证明了 CLIP-ViT 在近似细粒度物体知识方面的惊人能力。我们讨论了相对于其他模型(例如多模态与仅图像预训练、数据集大小、架构)这种优势的可能来源。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e976/11261149/f9fd064bcbab/gr7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e976/11261149/99d5bfb5f279/fx1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e976/11261149/3aa423909376/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e976/11261149/227157357c2b/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e976/11261149/9e94c11492c4/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e976/11261149/156efe39b592/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e976/11261149/05173ef9b74f/gr5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e976/11261149/7e4ba12877b7/gr6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e976/11261149/f9fd064bcbab/gr7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e976/11261149/99d5bfb5f279/fx1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e976/11261149/3aa423909376/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e976/11261149/227157357c2b/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e976/11261149/9e94c11492c4/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e976/11261149/156efe39b592/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e976/11261149/05173ef9b74f/gr5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e976/11261149/7e4ba12877b7/gr6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e976/11261149/f9fd064bcbab/gr7.jpg

相似文献

1
Fine-grained knowledge about manipulable objects is well-predicted by contrastive language image pre-training.关于可操纵物体的细粒度知识可以通过对比语言图像预训练得到很好的预测。
iScience. 2024 Jun 17;27(7):110297. doi: 10.1016/j.isci.2024.110297. eCollection 2024 Jul 19.
2
CLIP-Driven Fine-Grained Text-Image Person Re-Identification.基于CLIP的细粒度文本-图像人物重识别
IEEE Trans Image Process. 2023;32:6032-6046. doi: 10.1109/TIP.2023.3327924. Epub 2023 Nov 7.
3
The function of words: distinct neural correlates for words denoting differently manipulable objects.词语的功能:表示不同可操作性物体的词语具有不同的神经关联。
J Cogn Neurosci. 2010 Aug;22(8):1844-51. doi: 10.1162/jocn.2009.21310.
4
Latent Space Semantic Supervision Based on Knowledge Distillation for Cross-Modal Retrieval.基于知识蒸馏的潜在空间语义监督用于跨模态检索
IEEE Trans Image Process. 2022;31:7154-7164. doi: 10.1109/TIP.2022.3220051. Epub 2022 Nov 16.
5
Impacts of Image Obfuscation on Fine-grained Activity Recognition in Egocentric Video.图像模糊处理对自我中心视频中细粒度活动识别的影响
Proc IEEE Int Conf Pervasive Comput Commun Workshops. 2022 Mar;2022:341-346. doi: 10.1109/percomworkshops53856.2022.9767447. Epub 2022 May 6.
6
X -VLM: All-in-One Pre-Trained Model for Vision-Language Tasks.X-VLM:用于视觉语言任务的一体化预训练模型。
IEEE Trans Pattern Anal Mach Intell. 2024 May;46(5):3156-3168. doi: 10.1109/TPAMI.2023.3339661. Epub 2024 Apr 3.
7
Proto-Adapter: Efficient Training-Free CLIP-Adapter for Few-Shot Image Classification.Proto-Adapter:用于少样本图像分类的高效无需训练的CLIP-Adapter
Sensors (Basel). 2024 Jun 4;24(11):3624. doi: 10.3390/s24113624.
8
Self-supervised pre-training with contrastive and masked autoencoder methods for dealing with small datasets in deep learning for medical imaging.基于对比和掩蔽自动编码器方法的自监督预训练在医学影像深度学习中小数据集处理中的应用。
Sci Rep. 2023 Nov 20;13(1):20260. doi: 10.1038/s41598-023-46433-0.
9
A Foundation Language-Image Model of the Retina (FLAIR): encoding expert knowledge in text supervision.视网膜的基础语言-图像模型(FLAIR):在文本监督中编码专家知识。
Med Image Anal. 2025 Jan;99:103357. doi: 10.1016/j.media.2024.103357. Epub 2024 Oct 1.
10
An Optimization Method for Lightweight Rock Classification Models: Transferred Rich Fine-Grained Knowledge.一种轻量级岩石分类模型的优化方法:转移丰富的细粒度知识。
Sensors (Basel). 2024 Jun 25;24(13):4127. doi: 10.3390/s24134127.

本文引用的文献

1
Using drawings and deep neural networks to characterize the building blocks of human visual similarity.利用绘图和深度神经网络来刻画人类视觉相似性的组成要素。
Mem Cognit. 2025 Jan;53(1):219-241. doi: 10.3758/s13421-024-01580-1. Epub 2024 May 30.
2
Characterizing the discriminability of visual categorical information in strongly connected voxels.刻画强连接体素中视觉类别信息的可辨别性。
Neuropsychologia. 2024 Mar 12;195:108815. doi: 10.1016/j.neuropsychologia.2024.108815. Epub 2024 Feb 2.
3
Semantic feature production norms for manipulable objects.
可操作性物体的语义特征生成规范。
Cogn Neuropsychol. 2023 May-Jun;40(3-4):167-185. doi: 10.1080/02643294.2023.2279185. Epub 2024 Jan 12.
4
Neural and behavioral signatures of the multidimensionality of manipulable object processing.可操纵物体加工的多维性的神经和行为特征。
Commun Biol. 2023 Sep 14;6(1):940. doi: 10.1038/s42003-023-05323-x.
5
The developmental trajectory of object recognition robustness: Children are like small adults but unlike big deep neural networks.客体识别稳健性的发展轨迹:儿童似小大人而非大深度神经网络。
J Vis. 2023 Jul 3;23(7):4. doi: 10.1167/jov.23.7.4.
6
The representational hierarchy in human and artificial visual systems in the presence of object-scene regularities.在存在物体-场景规律的情况下,人类和人工视觉系统中的表象层次结构。
PLoS Comput Biol. 2023 Apr 28;19(4):e1011086. doi: 10.1371/journal.pcbi.1011086. eCollection 2023 Apr.
7
Alternative Brain Connectivity Underscores Age-Related Differences in the Processing of Interactive Biological Motion.交互生物运动加工中的大脑连通性的替代指标凸显了年龄相关差异。
J Neurosci. 2023 May 17;43(20):3666-3674. doi: 10.1523/JNEUROSCI.2109-22.2023. Epub 2023 Mar 24.
8
Dimensions underlying human understanding of the reachable world.人类对可及世界的理解的维度。
Cognition. 2023 May;234:105368. doi: 10.1016/j.cognition.2023.105368. Epub 2023 Jan 13.
9
Understanding Human Object Vision: A Picture Is Worth a Thousand Representations.理解人类客体视觉:一张图片胜过千般表征。
Annu Rev Psychol. 2023 Jan 18;74:113-135. doi: 10.1146/annurev-psych-032720-041031. Epub 2022 Nov 15.
10
Feature-reweighted representational similarity analysis: A method for improving the fit between computational models, brains, and behavior.特征加权代表性相似性分析:一种提高计算模型、大脑和行为之间拟合度的方法。
Neuroimage. 2022 Aug 15;257:119294. doi: 10.1016/j.neuroimage.2022.119294. Epub 2022 May 14.