用于目标检测的可变形部件区域学习与特征聚合树表示

Deformable Part Region Learning and Feature Aggregation Tree Representation for Object Detection.

作者信息

Bae Seung-Hwan

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Sep;45(9):10817-10834. doi: 10.1109/TPAMI.2023.3268864. Epub 2023 Aug 7.

DOI:10.1109/TPAMI.2023.3268864

Abstract

Region-based object detection infers object regions for one or more categories in an image. Due to the recent advances in deep learning and region proposal methods, object detectors based on convolutional neural networks (CNNs) have been flourishing and provided promising detection results. However, the accuracy of the convolutional object detectors can be degraded often due to the low feature discriminability caused by geometric variation or transformation of an object. In this article, we propose a deformable part region (DPR) learning in order to allow decomposed part regions to be deformable according to the geometric transformation of an object. Because the ground truth of the part models is not available in many cases, we design part model losses for the detection and segmentation, and learn the geometric parameters by minimizing an integral loss including those part losses. As a result, we can train our DPR network without extra supervision, and make multi-part models deformable according to object geometric variation. Moreover, we propose a novel feature aggregation tree (FAT) so as to learn more discriminative region of interest (RoI) features via bottom-up tree construction. The FAT can learn the stronger semantic features by aggregating part RoI features along the bottom-up pathways of the tree. We also present a spatial and channel attention mechanism for the aggregation between different node features. Based on the proposed DPR and FAT networks, we design a new cascade architecture that can refine detection tasks iteratively. Without bells and whistles, we achieve impressive detection and segmentation results on MSCOCO and PASCAL VOC datasets. Our Cascade D-PRD achieves the 57.9 box AP with the Swin-L backbone. We also provide an extensive ablation study to prove the effectiveness and usefulness of the proposed methods for large-scale object detection.

摘要

基于区域的目标检测可推断图像中一个或多个类别的目标区域。由于深度学习和区域提议方法的最新进展，基于卷积神经网络（CNN）的目标检测器蓬勃发展，并取得了令人瞩目的检测结果。然而，卷积目标检测器的准确性常常会因目标的几何变化或变换导致的特征可辨别性低而降低。在本文中，我们提出了一种可变形部件区域（DPR）学习方法，以使分解后的部件区域能够根据目标的几何变换而变形。由于在许多情况下部件模型的真实标注不可用，我们设计了用于检测和分割的部件模型损失，并通过最小化包含这些部件损失的积分损失来学习几何参数。结果，我们可以在无需额外监督的情况下训练我们的DPR网络，并使多部件模型根据目标几何变化而变形。此外，我们提出了一种新颖的特征聚合树（FAT），以便通过自底向上的树构建来学习更具辨别力的感兴趣区域（RoI）特征。FAT可以通过沿树的自底向上路径聚合部件RoI特征来学习更强的语义特征。我们还提出了一种空间和通道注意力机制，用于不同节点特征之间的聚合。基于所提出的DPR和FAT网络，我们设计了一种新的级联架构，该架构可以迭代地细化检测任务。无需复杂的技巧，我们在MSCOCO和PASCAL VOC数据集上取得了令人印象深刻的检测和分割结果。我们的级联D-PRD使用Swin-L主干网络实现了57.9的框AP。我们还进行了广泛的消融研究，以证明所提出的方法在大规模目标检测中的有效性和实用性。