Cui Hejie, Fang Xinyu, Zhang Zihan, Xu Ran, Kan Xuan, Liu Xin, Yu Yue, Li Manling, Song Yangqiu, Yang Carl
Emory University.
Tongji University.
Adv Neural Inf Process Syst. 2023 Dec;36:23499-23519. Epub 2024 May 30.
Images contain rich relational knowledge that can help machines understand the world. Existing methods on visual knowledge extraction often rely on the pre-defined format (e.g., sub-verb-obj tuples) or vocabulary (e.g., relation types), restricting the expressiveness of the extracted knowledge. In this work, we take a first exploration to a new paradigm of open visual knowledge extraction. To achieve this, we present OpenVik which consists of an open relational region detector to detect regions potentially containing relational knowledge and a visual knowledge generator that generates format-free knowledge by prompting the large multimodality model with the detected region of interest. We also explore two data enhancement techniques for diversifying the generated format-free visual knowledge. Extensive knowledge quality evaluations highlight the correctness and uniqueness of the extracted open visual knowledge by OpenVik. Moreover, integrating our extracted knowledge across various visual reasoning applications shows consistent improvements, indicating the real-world applicability of OpenVik.
图像包含丰富的关系知识,可帮助机器理解世界。现有的视觉知识提取方法通常依赖于预定义的格式(例如,主谓宾元组)或词汇(例如,关系类型),这限制了所提取知识的表达能力。在这项工作中,我们首次探索了一种新的开放视觉知识提取范式。为实现这一目标,我们提出了OpenVik,它由一个开放关系区域检测器和一个视觉知识生成器组成,前者用于检测可能包含关系知识的区域,后者通过用检测到的感兴趣区域提示大型多模态模型来生成无格式知识。我们还探索了两种数据增强技术,以使生成的无格式视觉知识多样化。广泛的知识质量评估突出了OpenVik所提取的开放视觉知识的正确性和独特性。此外,将我们提取的知识集成到各种视觉推理应用中显示出持续的改进,这表明OpenVik在现实世界中的适用性。