IEEE Trans Image Process. 2021;30:2155-2167. doi: 10.1109/TIP.2021.3049948. Epub 2021 Jan 26.
Cross-domain pedestrian detection, which has been attracting much attention, assumes that the training and test images are drawn from different data distributions. Existing methods focus on aligning the descriptions of whole candidate instances between source and target domains. Since there exists a giant visual difference among the candidate instances, aligning whole candidate instances between two domains cannot overcome the inter-instance difference. Compared with aligning the whole candidate instances, we consider that aligning each type of instances separately is a more reasonable manner. Therefore, we propose a novel Selective Alignment Network for cross-domain pedestrian detection, which consists of three components: a Base Detector, an Image-Level Adaptation Network, and an Instance-Level Adaptation Network. The Image-Level Adaptation Network and Instance-Level Adaptation Network can be regarded as the global-level and local-level alignments, respectively. Similar to the Faster R-CNN, the Base Detector, which is composed of a Feature module, an RPN module and a Detection module, is used to infer a robust pedestrian detector with the annotated source data. Once obtaining the image description extracted by the Feature module, the Image-Level Adaptation Network is proposed to align the image description with an adversarial domain classifier. Given the candidate proposals generated by the RPN module, the Instance-Level Adaptation Network firstly clusters the source candidate proposals into several groups according to their visual features, and thus generates the pseudo label for each candidate proposal. After generating the pseudo labels, we align the source and target domains by maximizing and minimizing the discrepancy between the prediction of two classifiers iteratively. Extensive evaluations on several benchmarks demonstrate the effectiveness of the proposed approach for cross-domain pedestrian detection.
跨领域行人检测受到了广泛关注,它假设训练图像和测试图像来自不同的数据分布。现有的方法主要集中在对齐源域和目标域中候选实例的整体描述。由于候选实例之间存在巨大的视觉差异,因此在两个域之间对齐整个候选实例并不能克服实例间的差异。与对齐整个候选实例相比,我们认为分别对齐每种类型的实例是一种更合理的方式。因此,我们提出了一种新的跨领域行人检测选择性对齐网络,它由三个组件组成:基础检测器、图像级自适应网络和实例级自适应网络。图像级自适应网络和实例级自适应网络可以分别看作是全局级和局部级对齐。与 Faster R-CNN 类似,基础检测器由特征模块、RPN 模块和检测模块组成,用于使用带有注释的源数据推断出稳健的行人检测器。一旦获得由特征模块提取的图像描述,就会使用图像级自适应网络来对齐图像描述与对抗性域分类器。给定 RPN 模块生成的候选建议,实例级自适应网络首先根据其视觉特征将源候选建议聚类为几个组,从而为每个候选建议生成伪标签。生成伪标签后,我们通过最大化和最小化两个分类器的预测之间的差异来迭代地对齐源域和目标域。在几个基准上的广泛评估表明,该方法在跨领域行人检测方面是有效的。