IEEE Trans Cybern. 2022 May;52(5):3669-3683. doi: 10.1109/TCYB.2020.3015084. Epub 2022 May 19.
With the rapid growth of multimedia data on the Internet, there has been a rapid rise in the demand for visual-textual cross-media retrieval between images and sentences. However, the heterogeneous property of visual and textual data brings huge challenges to measure the cross-media similarity for retrieval. Although existing methods have achieved great progress with the strong learning ability of the deep neural network, they rely heavily on the scale of training data with manual annotation, that is, either pairwise image-sentence annotation or category annotation as supervised information for visual-textual correlation learning, which are extremely labor and time consuming to collect. Without any pairwise or category annotation, it is highly challenging to construct a correlation between images and sentences due to their inconsistent distributions and representations. But people can naturally understand the correlation between visual and textual data in high-level semantic, and those images and sentences containing the same group of semantic concepts can be easily matched in human brain. Inspired by the above human cognitive process, this article proposes an unsupervised visual-textual correlation learning (UVCL) approach to construct correlations without any manual annotation. The contributions are summarized as follows: 1) unsupervised semantic-guided cross-media correlation mining is proposed to bridge the heterogeneous gap between visual and textual data. We measure the semantic matching degree between images and sentences, and generate descriptive sentences according to the concepts extracted from images to further augment the training data in an unsupervised manner. Therefore, the approach can exploit the semantic knowledge within both visual and textual data to reduce the gap between them for further correlation learning and 2) unsupervised visual-textual fine-grained semantic alignment is proposed to learn cross-media correlation by aligning the fine-grained visual local patches and textual keywords with fine-grained soft attention as well as semantic-guided hard attention, and the results can effectively highlight the fine-grained semantic information within both images and sentences to boost visual-textual alignment. Extensive experiments are conducted to perform visual-textual cross-media retrieval in unsupervised setting without any manual annotation on two widely used datasets, namely, Flickr-30K and MS-COCO, which verify the effectiveness of our proposed UVCL approach.
随着互联网上多媒体数据的快速增长,人们对图像和句子之间的视觉-文本跨媒体检索的需求迅速增加。然而,视觉和文本数据的异构性质给检索的跨媒体相似性度量带来了巨大的挑战。尽管现有的方法已经在深度学习神经网络的强大学习能力的帮助下取得了很大的进展,但它们严重依赖于带有手动注释的训练数据的规模,即图像-句子的成对注释或类别注释作为视觉-文本相关性学习的监督信息,这在收集方面既耗费人力又耗费时间。如果没有任何成对的或类别的注释,由于它们不一致的分布和表示,构建图像和句子之间的相关性是极具挑战性的。但是,人们可以在高层语义上自然地理解视觉和文本数据之间的相关性,那些包含相同语义概念的图像和句子在人类大脑中很容易匹配。受上述人类认知过程的启发,本文提出了一种无监督的视觉-文本相关性学习(UVCL)方法,无需任何手动注释即可构建相关性。主要贡献总结如下:1)提出了一种无监督的语义引导跨媒体相关挖掘方法,以弥合视觉和文本数据之间的异构差距。我们测量图像和句子之间的语义匹配程度,并根据从图像中提取的概念生成描述性句子,以在无监督的方式进一步扩充训练数据。因此,该方法可以利用视觉和文本数据中的语义知识来缩小它们之间的差距,以进一步进行相关性学习;2)提出了一种无监督的视觉-文本细粒度语义对齐方法,通过使用细粒度软注意力和语义引导硬注意力对齐细粒度视觉局部补丁和文本关键词来学习跨媒体相关性,并且结果可以有效地突出图像和句子中的细粒度语义信息,从而增强视觉-文本对齐。在两个广泛使用的数据集,即 Flickr-30K 和 MS-COCO 上进行了广泛的实验,以在无任何手动注释的情况下进行无监督的视觉-文本跨媒体检索,验证了我们提出的 UVCL 方法的有效性。