视觉与语言研究中的挑战与前景

Challenges and Prospects in Vision and Language Research.

作者信息

Kafle Kushal, Shrestha Robik, Kanan Christopher

机构信息

Center for Imaging Science, Rochester Institute of Technology, Rochester, NY, United States.

Paige, New York, NY, United States.

出版信息

Front Artif Intell. 2019 Dec 13;2:28. doi: 10.3389/frai.2019.00028. eCollection 2019.

DOI:10.3389/frai.2019.00028

PMID:33733117

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7861287/

Abstract

Language grounded image understanding tasks have often been proposed as a method for evaluating progress in artificial intelligence. Ideally, these tasks should test a plethora of capabilities that integrate computer vision, reasoning, and natural language understanding. However, the datasets and evaluation procedures used in these tasks are replete with flaws which allows the vision and language (V&L) algorithms to achieve a good performance without a robust understanding of vision and language. We argue for this position based on several recent studies in V&L literature and our own observations of dataset bias, robustness, and spurious correlations. Finally, we propose that several of these challenges can be mitigated by creation of carefully designed benchmarks.

摘要

基于语言的图像理解任务经常被提议作为评估人工智能进展的一种方法。理想情况下，这些任务应该测试大量整合了计算机视觉、推理和自然语言理解的能力。然而，这些任务中使用的数据集和评估程序存在大量缺陷，这使得视觉与语言（V&L）算法在没有对视觉和语言进行稳健理解的情况下就能取得良好性能。基于V&L文献中的几项最新研究以及我们自己对数据集偏差、稳健性和虚假相关性的观察，我们支持这一观点。最后，我们提出通过创建精心设计的基准可以缓解其中的一些挑战。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/62d5/7861287/dc965b95e658/frai-02-00028-g0001.jpg

相似文献

Challenges and Prospects in Vision and Language Research.视觉与语言研究中的挑战与前景

Front Artif Intell. 2019 Dec 13;2:28. doi: 10.3389/frai.2019.00028. eCollection 2019.

Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training.通过视觉语言预训练实现医学图像与文本的多模态理解与生成

IEEE J Biomed Health Inform. 2022 Dec;26(12):6070-6080. doi: 10.1109/JBHI.2022.3207502. Epub 2022 Dec 7.

Visual Cluster Grounding for Image Captioning.用于图像字幕的视觉聚类基础

IEEE Trans Image Process. 2022;31:3920-3934. doi: 10.1109/TIP.2022.3177318. Epub 2022 Jun 9.

Arabic Captioning for Images of Clothing Using Deep Learning.基于深度学习的服装图像阿拉伯语字幕生成。

Sensors (Basel). 2023 Apr 7;23(8):3783. doi: 10.3390/s23083783.

Vision-to-Language Tasks Based on Attributes and Attention Mechanism.基于属性和注意力机制的视觉-语言任务。

IEEE Trans Cybern. 2021 Feb;51(2):913-926. doi: 10.1109/TCYB.2019.2914351. Epub 2021 Jan 15.

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset.VALOR：视听语言全感知预训练模型与数据集

IEEE Trans Pattern Anal Mach Intell. 2025 Feb;47(2):708-724. doi: 10.1109/TPAMI.2024.3479776. Epub 2025 Jan 9.

Image Captioning and Visual Question Answering Based on Attributes and External Knowledge.基于属性和外部知识的图像字幕和视觉问答。

IEEE Trans Pattern Anal Mach Intell. 2018 Jun;40(6):1367-1381. doi: 10.1109/TPAMI.2017.2708709. Epub 2017 May 26.

Robust Visual Question Answering: Datasets, Methods, and Future Challenges.鲁棒视觉问答：数据集、方法及未来挑战

IEEE Trans Pattern Anal Mach Intell. 2024 Aug;46(8):5575-5594. doi: 10.1109/TPAMI.2024.3366154. Epub 2024 Jul 2.

From Show to Tell: A Survey on Deep Learning-Based Image Captioning.从展示到讲述：基于深度学习的图像字幕研究综述

IEEE Trans Pattern Anal Mach Intell. 2023 Jan;45(1):539-559. doi: 10.1109/TPAMI.2022.3148210. Epub 2022 Dec 5.

Deconfounded Image Captioning: A Causal Retrospect.去混淆图像字幕：因果回顾

IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):12996-13010. doi: 10.1109/TPAMI.2021.3121705. Epub 2023 Oct 3.

引用本文的文献

Detecting Spurious Correlations With Sanity Tests for Artificial Intelligence Guided Radiology Systems.利用人工智能辅助放射学系统的合理性测试检测虚假相关性

Front Digit Health. 2021 Aug 3;3:671015. doi: 10.3389/fdgth.2021.671015. eCollection 2021.

Linguistic issues behind visual question answering.视觉问答背后的语言问题。

Lang Linguist Compass. 2021 Jun;15(6):e12417. doi: 10.1111/lnc3.12417. Epub 2021 Jun 4.

Unanswerable Questions About Images and Texts.关于图像和文本的无法回答的问题。

Front Artif Intell. 2020 Jul 29;3:51. doi: 10.3389/frai.2020.00051. eCollection 2020.

Evaluating Multimedia and Language Tasks.评估多媒体和语言任务。

Front Artif Intell. 2020 May 5;3:32. doi: 10.3389/frai.2020.00032. eCollection 2020.

本文引用的文献

Interpretable Visual Question Answering by Reasoning on Dependency Trees.基于依存树推理的可解释视觉问答。

IEEE Trans Pattern Anal Mach Intell. 2021 Mar;43(3):887-901. doi: 10.1109/TPAMI.2019.2943456. Epub 2021 Feb 4.

A systematic study of the class imbalance problem in convolutional neural networks.卷积神经网络中类不平衡问题的系统研究。

Neural Netw. 2018 Oct;106:249-259. doi: 10.1016/j.neunet.2018.07.011. Epub 2018 Jul 29.

Long-Term Recurrent Convolutional Networks for Visual Recognition and Description.长期递归卷积网络的视觉识别与描述。

IEEE Trans Pattern Anal Mach Intell. 2017 Apr;39(4):677-691. doi: 10.1109/TPAMI.2016.2599174. Epub 2016 Sep 1.

Visual Turing test for computer vision systems.计算机视觉系统的视觉图灵测试。

Proc Natl Acad Sci U S A. 2015 Mar 24;112(12):3618-23. doi: 10.1073/pnas.1422953112. Epub 2015 Mar 9.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

视觉与语言研究中的挑战与前景

Challenges and Prospects in Vision and Language Research.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献