Yu Yi, Tang Suhua, Aizawa Kiyoharu, Aizawa Akiko
IEEE Trans Neural Netw Learn Syst. 2019 Apr;30(4):1250-1258. doi: 10.1109/TNNLS.2018.2856253. Epub 2018 Aug 10.
In this work, travel destinations and business locations are taken as venues. Discovering a venue by a photograph is very important for visual context-aware applications. Unfortunately, few efforts paid attention to complicated real images such as venue photographs generated by users. Our goal is fine-grained venue discovery from heterogeneous social multimodal data. To this end, we propose a novel deep learning model, category-based deep canonical correlation analysis. Given a photograph as input, this model performs: 1) exact venue search (find the venue where the photograph was taken) and 2) group venue search (find relevant venues that have the same category as the photograph), by the cross-modal correlation between the input photograph and textual description of venues. In this model, data in different modalities are projected to a same space via deep networks. Pairwise correlation (between different modality data from the same venue) for exact venue search and category-based correlation (between different modality data from different venues with the same category) for group venue search are jointly optimized. Because a photograph cannot fully reflect rich text description of a venue, the number of photographs per venue in the training phase is increased to capture more aspects of a venue. We build a new venue-aware multimodal data set by integrating Wikipedia featured articles and Foursquare venue photographs. Experimental results on this data set confirm the feasibility of the proposed method. Moreover, the evaluation over another publicly available data set confirms that the proposed method outperforms state of the arts for cross-modal retrieval between image and text.
在这项工作中,旅游目的地和商业地点被视为场所。通过照片发现场所对于视觉上下文感知应用非常重要。不幸的是,很少有人关注复杂的真实图像,例如用户生成的场所照片。我们的目标是从异构社会多模态数据中进行细粒度的场所发现。为此,我们提出了一种新颖的深度学习模型,即基于类别的深度典型相关分析。给定一张照片作为输入,该模型通过输入照片与场所文本描述之间的跨模态相关性来执行:1)精确场所搜索(找到拍摄照片的场所)和2)群组场所搜索(找到与照片具有相同类别的相关场所)。在这个模型中,不同模态的数据通过深度网络投影到同一个空间。联合优化用于精确场所搜索的成对相关性(来自同一场所的不同模态数据之间)和用于群组场所搜索的基于类别的相关性(来自具有相同类别的不同场所的不同模态数据之间)。由于照片不能完全反映场所丰富的文本描述,因此在训练阶段增加每个场所的照片数量以捕捉场所的更多方面。我们通过整合维基百科特色文章和四方场所照片构建了一个新的场所感知多模态数据集。在这个数据集上的实验结果证实了所提方法的可行性。此外,在另一个公开可用数据集上的评估证实,所提方法在图像与文本之间的跨模态检索方面优于现有技术。