IEEE Trans Image Process. 2022;31:4842-4855. doi: 10.1109/TIP.2022.3187565. Epub 2022 Jul 20.
Extracting robust and discriminative local features from images plays a vital role for long term visual localization, whose challenges are mainly caused by the severe appearance differences between matching images due to the day-night illuminations, seasonal changes, and human activities. Existing solutions resort to jointly learning both keypoints and their descriptors in an end-to-end manner, leveraged on large number of annotations of point correspondence which are harvested from the structure from motion and depth estimation algorithms. While these methods show improved performance over non-deep methods or those two-stage deep methods, i.e., detection and then description, they are still struggled to conquer the problems encountered in long term visual localization. Since the intrinsic semantics are invariant to the local appearance changes, this paper proposes to learn semantic-aware local features in order to improve robustness of local feature matching for long term localization. Based on a state of the art CNN architecture for local feature learning, i.e., ASLFeat, this paper leverages on the semantic information from an off-the-shelf semantic segmentation network to learn semantic-aware feature maps. The learned correspondence-aware feature descriptors and semantic features are then merged to form the final feature descriptors, for which the improved feature matching ability has been observed in experiments. In addition, the learned semantics embedded in the features can be further used to filter out noisy keypoints, leading to additional accuracy improvement and faster matching speed. Experiments on two popular long term visual localization benchmarks (Aachen Day and Night v1.1, Robotcar Seasons) and one challenging indoor benchmark (InLoc) demonstrate encouraging improvements of the localization accuracy over its counterpart and other competitive methods.
从图像中提取稳健且具有鉴别力的局部特征对于长期视觉定位至关重要,其挑战主要源于匹配图像之间由于日夜光照、季节变化和人类活动等因素导致的严重外观差异。现有的解决方案倾向于以端到端的方式联合学习关键点及其描述符,利用从运动结构和深度估计算法中提取的大量点对应标注。虽然这些方法在非深度方法或两阶段深度方法(即检测然后描述)方面表现出了改进的性能,但它们仍然难以克服长期视觉定位中遇到的问题。由于内在语义对于局部外观变化是不变的,因此本文提出了学习语义感知的局部特征,以提高局部特征匹配的鲁棒性,从而实现长期本地化。基于一种用于局部特征学习的先进卷积神经网络架构(ASLFeat),本文利用来自现成语义分割网络的语义信息来学习语义感知特征图。然后将学习到的对应感知特征描述符和语义特征合并形成最终的特征描述符,实验中观察到改进的特征匹配能力。此外,嵌入在特征中的学习语义可进一步用于过滤掉噪声关键点,从而实现额外的准确性提高和更快的匹配速度。在两个流行的长期视觉定位基准(Aachen Day and Night v1.1、Robotcar Seasons)和一个具有挑战性的室内基准(InLoc)上进行的实验表明,与基准和其他竞争方法相比,定位精度有了令人鼓舞的提高。