Duan Kaiwen, Bai Song, Xie Lingxi, Qi Honggang, Huang Qingming, Tian Qi
IEEE Trans Pattern Anal Mach Intell. 2024 May;46(5):3509-3521. doi: 10.1109/TPAMI.2023.3342120. Epub 2024 Apr 3.
There are two mainstream approaches for object detection: top-down and bottom-up. The state-of-the-art approaches are mainly top-down methods. In this paper, we demonstrate that bottom-up approaches show competitive performance compared with top-down approaches and have higher recall rates. Our approach, named CenterNet, detects each object as a triplet of keypoints (top-left and bottom-right corners and the center keypoint). We first group the corners according to some designed cues and confirm the object locations based on the center keypoints. The corner keypoints allow the approach to detect objects of various scales and shapes and the center keypoint reduces the confusion introduced by a large number of false-positive proposals. Our approach is an anchor-free detector because it does not need to define explicit anchor boxes. We adapt our approach to backbones with different structures, including 'hourglass'-like networks and 'pyramid'-like networks, which detect objects in single-resolution and multi-resolution feature maps, respectively. On the MS-COCO dataset, CenterNet with Res2Net-101 and Swin-Transformer achieve average precisions (APs) of 53.7% and 57.1%, respectively, outperforming all existing bottom-up detectors and achieving state-of-the-art performance. We also design a real-time CenterNet model, which achieves a good trade-off between accuracy and speed, with an AP of 43.6% at 30.5 frames per second (FPS).
自上而下和自下而上。目前最先进的方法主要是自上而下的方法。在本文中,我们证明自下而上的方法与自上而下的方法相比具有竞争力的性能,并且召回率更高。我们的方法名为CenterNet,将每个对象检测为一个关键点三元组(左上角和右下角以及中心关键点)。我们首先根据一些设计好的线索对角落进行分组,并基于中心关键点确定对象位置。角落关键点使该方法能够检测各种尺度和形状的对象,而中心关键点减少了大量误报提议带来的混淆。我们的方法是一种无锚检测器,因为它不需要定义明确的锚框。我们将我们的方法应用于具有不同结构的主干网络,包括类似“沙漏”的网络和类似“金字塔”的网络,它们分别在单分辨率和多分辨率特征图中检测对象。在MS-COCO数据集上,配备Res2Net-101和Swin-Transformer的CenterNet分别实现了53.7%和57.1%的平均精度(AP),优于所有现有的自下而上检测器并达到了最先进的性能。我们还设计了一个实时CenterNet模型,该模型在准确性和速度之间实现了良好的平衡,在每秒30.5帧(FPS)的情况下AP为43.6%。