Indian Institute of Technology, Bombay, India.
Indian Institute of Technology, Bombay, India.
Neural Netw. 2024 Dec;180:106718. doi: 10.1016/j.neunet.2024.106718. Epub 2024 Sep 11.
With the rapid advent and abundance of remote sensing data in different modalities, cross-modal retrieval tasks have gained importance in the research community. Cross-modal retrieval belongs to the research paradigm in which the query is of one modality and the retrieved output is of the other modality. In this paper, the remote sensing (RS) data modalities considered are the earth observation optical data (aerial photos) and the corresponding hand-drawn sketches. The main challenge of the cross-modal retrieval research objective for optical remote sensing images and the corresponding sketches is the distribution gap between the shared embedding space of the modalities. Prior attempts to resolve this issue have not yielded satisfactory outcomes regarding accurately retrieving cross-modal sketch-image RS data. The state-of-the-art architectures used conventional convolutional architectures, which focused on local pixel-wise information about the modalities to be retrieved. This limits the interaction between the sketch texture and the corresponding image, making these models susceptible to overfitting datasets with particular scenarios. To circumvent this limitation, we suggest establishing multi-modal correspondence using a novel architecture of the combined self and cross-attention algorithms, SPCA-Net to minimize the modality gap by employing attention mechanisms for the query and other modalities. Efficient cross-modal retrieval is achieved through the suggested attention architecture, which empirically emphasizes the global information of the relevant query modality and bridges the domain gap through a unique pairwise cross-attention network. In addition to the novel architecture, this paper introduces a unique loss function, label-specific supervised contrastive loss, tailored to the intricacies of the task and to enhance the discriminative power of the learned embeddings. Extensive evaluations are conducted on two sketch-image remote sensing datasets, Earth-on-Canvas and RSketch. Under the same experimental conditions, the performance metrics of our proposed model beat the state-of-the-art architectures by significant margins of 16.7%, 18.9%, 33.7%, and 40.9% correspondingly.
随着不同模态遥感数据的快速涌现和丰富,跨模态检索任务在研究界变得越来越重要。跨模态检索属于查询为一种模态,检索输出为另一种模态的研究范式。在本文中,所考虑的遥感(RS)数据模态是地球观测光学数据(航空照片)和相应的手绘草图。光学遥感图像和相应草图的跨模态检索研究目标的主要挑战是模态共享嵌入空间之间的分布差距。以前尝试解决这个问题的方法在准确检索跨模态草图-图像 RS 数据方面并没有取得令人满意的结果。用于检索的现有最先进架构使用了传统的卷积架构,这些架构专注于要检索的模态的局部像素级信息。这限制了草图纹理和相应图像之间的相互作用,使得这些模型容易受到具有特定场景的数据集的过拟合。为了规避此限制,我们建议使用组合自注意力和交叉注意力算法的新型架构 SPCA-Net 来建立多模态对应关系,通过使用注意力机制来最小化模态差距。所提出的注意力架构通过使用注意力机制来强调相关查询模态的全局信息,并通过独特的成对交叉注意力网络来弥合域差距,从而实现高效的跨模态检索。除了新颖的架构外,本文还引入了一种独特的损失函数,即标签特定监督对比损失,旨在针对任务的复杂性和增强学习特征的辨别力。在两个草图-图像遥感数据集 Earth-on-Canvas 和 RSketch 上进行了广泛的评估。在相同的实验条件下,我们提出的模型的性能指标比现有最先进架构分别高出 16.7%、18.9%、33.7%和 40.9%。