Kafle Kushal, Shrestha Robik, Kanan Christopher
Center for Imaging Science, Rochester Institute of Technology, Rochester, NY, United States.
Paige, New York, NY, United States.
Front Artif Intell. 2019 Dec 13;2:28. doi: 10.3389/frai.2019.00028. eCollection 2019.
Language grounded image understanding tasks have often been proposed as a method for evaluating progress in artificial intelligence. Ideally, these tasks should test a plethora of capabilities that integrate computer vision, reasoning, and natural language understanding. However, the datasets and evaluation procedures used in these tasks are replete with flaws which allows the vision and language (V&L) algorithms to achieve a good performance without a robust understanding of vision and language. We argue for this position based on several recent studies in V&L literature and our own observations of dataset bias, robustness, and spurious correlations. Finally, we propose that several of these challenges can be mitigated by creation of carefully designed benchmarks.
基于语言的图像理解任务经常被提议作为评估人工智能进展的一种方法。理想情况下,这些任务应该测试大量整合了计算机视觉、推理和自然语言理解的能力。然而,这些任务中使用的数据集和评估程序存在大量缺陷,这使得视觉与语言(V&L)算法在没有对视觉和语言进行稳健理解的情况下就能取得良好性能。基于V&L文献中的几项最新研究以及我们自己对数据集偏差、稳健性和虚假相关性的观察,我们支持这一观点。最后,我们提出通过创建精心设计的基准可以缓解其中的一些挑战。