对于多模态生成式人工智能来说，视觉枚举仍然具有挑战性。

Visual enumeration remains challenging for multimodal generative AI.

作者信息

Testolin Alberto, Hou Kuinan, Zorzi Marco

机构信息

Department of General Psychology and Department of Mathematics, University of Padova, Padova, Italy.

Department of General Psychology, University of Padova, Padova, Italy.

出版信息

PLoS One. 2025 Sep 12;20(9):e0331566. doi: 10.1371/journal.pone.0331566. eCollection 2025.

DOI:10.1371/journal.pone.0331566

PMID:40938963

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12431670/

Abstract

Many animal species can approximately judge the number of objects in a visual scene at a single glance, and humans can further determine the exact cardinality of a set by deploying systematic counting procedures. In contrast, it has been observed that even state-of-the-art AI systems have very limited enumeration skills. In this work, we propose two benchmark tasks inspired by cognitive science that allow to precisely evaluate the visual enumeration capabilities of multimodal foundation models, thereby providing an objective measure of their number sense and counting level. We consider popular visual question answering models (BLIP, LLaVA and ViLT) as well as advanced image-to-text (Gemini, GPT and Qwen) and text-to-image (DALL-E, FLUX and Stable Diffusion) AI systems. Our analyses show that even the most advanced models cannot reliably name the number of objects in simple visual stimuli or generate images containing a target number of items, as indexed by their low accuracy in both types of tasks. Especially for numbers outside the subitizing range, their responses are often far from the target numerosity, and, in stark contrast with human behavior, in many cases the distribution of errors depends on the object category. We also observe some striking mistakes with small numbers. Our findings demonstrate that developing an intuitive visual understanding of number remains challenging for AI models and that merely increasing model size might not be a viable strategy to promote the emergence of systematic counting skills. We release the full code of our benchmark to facilitate the evaluation of enumeration skills in future AI systems.

摘要

许多动物物种能够在一瞥之间大致判断视觉场景中物体的数量，而人类则可以通过运用系统的计数程序进一步确定一组物体的确切基数。相比之下，据观察，即使是最先进的人工智能系统，其计数技能也非常有限。在这项工作中，我们提出了两项受认知科学启发的基准任务，用于精确评估多模态基础模型的视觉计数能力，从而客观衡量它们的数感和计数水平。我们考虑了流行的视觉问答模型（BLIP、LLaVA和ViLT）以及先进的图像到文本（Gemini、GPT和Qwen）和文本到图像（DALL-E、FLUX和Stable Diffusion）人工智能系统。我们的分析表明，即使是最先进的模型，也无法可靠地说出简单视觉刺激中物体的数量，或者生成包含目标数量物品的图像，这两类任务的低准确率就表明了这一点。特别是对于超出一眼能数清范围的数字，它们的回答往往与目标数量相差甚远，而且与人类行为形成鲜明对比的是，在许多情况下，错误的分布取决于物体的类别。我们还观察到在小数字上也有一些惊人的错误。我们的研究结果表明，让人工智能模型直观地理解数字仍然具有挑战性，仅仅增加模型规模可能不是促进系统计数技能出现的可行策略。我们发布了基准测试的完整代码，以方便未来对人工智能系统的计数技能进行评估。