Liang Tingting, Chu Xiaojie, Liu Yudong, Wang Yongtao, Tang Zhi, Chu Wei, Chen Jingdong, Ling Haibin
IEEE Trans Image Process. 2022 Oct 28;PP. doi: 10.1109/TIP.2022.3216771.
top-performing object detectors depend heavily on backbone networks, whose advances bring consistent performance gains through exploring more effective network structures. In this paper, we propose a novel and flexible backbone framework, namely CBNet, to construct high-performance detectors using existing open-source pre-trained backbones under the pre-training fine-tuning paradigm. In particular, CBNet architecture groups multiple identical backbones, which are connected through composite connections. Specifically, it integrates the high- and low-level features of multiple identical backbone networks and gradually expands the receptive field to more effectively perform object detection. We also propose a better training strategy with auxiliary supervision for CBNet-based detectors. CBNet has strong generalization capabilities for different backbones and head designs of the detector architecture. Without additional pre-training of the composite backbone, CBNet can be adapted to various backbones (i.e., CNN-based vs. Transformer-based) and head designs of most mainstream detectors (i.e., one-stage vs. two-stage, anchor-based vs. anchor-free-based). Experiments provide strong evidence that, compared with simply increasing the depth and width of the network, CBNet introduces a more efficient, effective, and resource-friendly way to build high-performance backbone networks. Particularly, our CB-Swin-L achieves 59.4% box AP and 51.6% mask AP on COCO test-dev under the single-model and single-scale testing protocol, which are significantly better than the state-of-the-art results (i.e., 57.7% box AP and 50.2% mask AP) achieved by Swin-L, while reducing the training time by 6×. With multi-scale testing, we push the current best single model result to a new record of 60.1% box AP and 52.3% mask AP without using extra training data. Code is available at https://github.com/VDIGPKU/CBNetV2.
性能卓越的目标检测器严重依赖主干网络,其进步通过探索更有效的网络结构带来了持续的性能提升。在本文中,我们提出了一种新颖且灵活的主干框架,即CBNet,以在预训练微调范式下使用现有的开源预训练主干构建高性能检测器。具体而言,CBNet架构将多个相同的主干分组,这些主干通过复合连接相连。具体来说,它整合了多个相同主干网络的高低层特征,并逐步扩大感受野以更有效地执行目标检测。我们还为基于CBNet的检测器提出了一种带有辅助监督的更好的训练策略。CBNet对于检测器架构的不同主干和头部设计具有强大的泛化能力。无需对复合主干进行额外的预训练,CBNet就可以适应各种主干(即基于卷积神经网络的与基于Transformer的)以及大多数主流检测器的头部设计(即单阶段与两阶段、基于锚框的与无锚框的)。实验提供了有力证据,表明与简单增加网络的深度和宽度相比,CBNet引入了一种更高效、有效且资源友好的方式来构建高性能主干网络。特别是,我们的CB-Swin-L在单模型和单尺度测试协议下在COCO测试开发集上实现了59.4%的框AP和51.6%的掩码AP,显著优于Swin-L所取得的当前最优结果(即57.7%的框AP和50.2%的掩码AP),同时将训练时间减少了6倍。通过多尺度测试,我们在不使用额外训练数据的情况下将当前最佳单模型结果提升到了60.1%的框AP和52.3%的掩码AP的新记录。代码可在https://github.com/VDIGPKU/CBNetV2获取。