Bernardi Raffaella, Pezzelle Sandro
CIMeC and DISI University of Trento Trento Italy.
ILLC University of Amsterdam Amsterdam The Netherlands.
Lang Linguist Compass. 2021 Jun;15(6):e12417. doi: 10.1111/lnc3.12417. Epub 2021 Jun 4.
Answering a question that is in an image is a crucial ability that requires understanding the question, the visual context, and their interaction at many linguistic levels: among others, semantics, syntax and pragmatics. As such, visually-grounded questions have long been of interest to theoretical linguists and cognitive scientists. Moreover, they have inspired the first attempts to computationally model natural language understanding, where pioneering systems were faced with the highly challenging task-still unsolved-of jointly dealing with syntax, semantics and inference whilst understanding a visual context. Boosted by impressive advancements in machine learning, the task of answering visually-grounded questions has experienced a renewed interest in recent years, to the point of becoming a research sub-field at the intersection of computational linguistics and computer vision. In this paper, we review current approaches to the problem which encompass the development of datasets, models and frameworks. We conduct our investigation from the perspective of the theoretical linguists; we extract from pioneering computational linguistic work a list of that we use to review current computational achievements. We acknowledge that impressive progress has been made to reconcile the engineering with the theoretical view. At the same time, we claim that further research is needed to get to a unified approach which jointly encompasses all the underlying linguistic problems. We conclude the paper by sharing our own desiderata for the future.
回答图像中的问题是一项至关重要的能力,这需要在许多语言层面理解问题、视觉语境以及它们之间的相互作用:尤其是语义、句法和语用学等层面。因此,基于视觉的问题长期以来一直受到理论语言学家和认知科学家的关注。此外,它们还激发了人们首次尝试对自然语言理解进行计算建模,在这个过程中,开创性的系统面临着一项极具挑战性且至今仍未解决的任务,即在理解视觉语境的同时联合处理句法、语义和推理。在机器学习取得令人瞩目的进展的推动下,回答基于视觉的问题这一任务近年来再次受到关注,甚至成为了计算语言学和计算机视觉交叉领域的一个研究子领域。在本文中,我们回顾了针对该问题的当前方法,这些方法涵盖了数据集、模型和框架的开发。我们从理论语言学家的角度进行研究;我们从开创性的计算语言学工作中提取了一系列内容,并用其来回顾当前的计算成果。我们承认在使工程学与理论观点相协调方面已经取得了令人瞩目的进展。同时,我们认为需要进一步研究以达成一种统一的方法,该方法能共同涵盖所有潜在的语言问题。我们通过分享我们对未来的期望来结束本文。