Huang Gaoshuang, Zhou Yang, Hu Xiaofei, Zhang Chenglong, Zhao Luying, Gan Wenjian
Institute of Geospatial Information, PLA Strategic Support Force Information Engineering University, Zhengzhou, 450001, China.
Sci Rep. 2024 Sep 27;14(1):22100. doi: 10.1038/s41598-024-73853-3.
Using visual place recognition (VPR) technology to ascertain the geographical location of publicly available images is a pressing issue. Although most current VPR methods achieve favorable results under ideal conditions, their performance in complex environments, characterized by lighting variations, seasonal changes, and occlusions, is generally unsatisfactory. Therefore, obtaining efficient and robust image feature descriptors in complex environments is a pressing issue. In this study, we utilized the DINOv2 model as the backbone for trimming and fine-tuning to extract robust image features and employed a feature mix module to aggregate image features, resulting in globally robust and generalizable descriptors that enable high-precision VPR. We experimentally demonstrated that the proposed DINO-Mix outperforms the current state-of-the-art (SOTA) methods. Using test sets having lighting variations, seasonal changes, and occlusions such as Tokyo24/7, Nordland, and SF-XL-Testv1, our proposed architecture achieved Top-1 accuracy rates of 91.75%, 80.18%, and 82%, respectively, and exhibited an average accuracy improvement of 5.14%. In addition, we compared it with other SOTA methods using representative image retrieval case studies, and our architecture outperformed its competitors in terms of VPR performance. Furthermore, we visualized the attention maps of DINO-Mix and other methods to provide a more intuitive understanding of their respective strengths. These visualizations serve as compelling evidence of the superiority of the DINO-Mix framework in this domain.
使用视觉位置识别(VPR)技术来确定公开可用图像的地理位置是一个紧迫的问题。尽管当前大多数VPR方法在理想条件下能取得良好的结果,但其在以光照变化、季节变化和遮挡为特征的复杂环境中的性能通常不尽人意。因此,在复杂环境中获得高效且鲁棒的图像特征描述符是一个紧迫的问题。在本研究中,我们利用DINOv2模型作为主干进行裁剪和微调以提取鲁棒的图像特征,并采用特征混合模块来聚合图像特征,从而得到全局鲁棒且可推广的描述符,实现高精度的VPR。我们通过实验证明,所提出的DINO-Mix优于当前的最先进(SOTA)方法。使用具有光照变化、季节变化和遮挡的测试集,如Tokyo24/7、Nordland和SF-XL-Testv1,我们提出的架构分别实现了91.75%、80.18%和82%的Top-1准确率,并且平均准确率提高了5.14%。此外,我们在代表性的图像检索案例研究中,将其与其他SOTA方法进行比较,我们的架构在VPR性能方面优于竞争对手。此外,我们对DINO-Mix和其他方法的注意力图进行了可视化,以便更直观地了解它们各自的优势。这些可视化结果有力地证明了DINO-Mix框架在该领域的优越性。