School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China.
Sensors (Basel). 2023 Apr 30;23(9):4425. doi: 10.3390/s23094425.
Hybrid models which combine the convolution and transformer model achieve impressive performance on human pose estimation. However, the existing hybrid models on human pose estimation, which typically stack self-attention modules after convolution, are prone to mutual conflict. The mutual conflict enforces one type of module to dominate over these hybrid sequential models. Consequently, the performance of higher-precision keypoints localization is not consistent with overall performance. To alleviate this mutual conflict, we developed a hybrid parallel network by parallelizing the self-attention modules and the convolution modules, which conduce to leverage the complementary capabilities effectively. The parallel network ensures that the self-attention branch tends to model the long-range dependency to enhance the semantic representation, whereas the local sensitivity of the convolution branch contributes to high-precision localization simultaneously. To further mitigate the conflict, we proposed a cross-branches attention module to gate the features generated by both branches along the channel dimension. The hybrid parallel network achieves 75.6% and 75.4% on COCO validation and test-dev sets and achieves consistent performance on both higher-precision localization and overall performance. The experiments show that our hybrid parallel network is on par with the state-of-the-art human pose estimation models.
混合模型结合卷积和变形金刚模型在人体姿态估计上取得了令人印象深刻的性能。然而,现有的人体姿态估计混合模型,通常在卷积后堆叠自注意模块,容易相互冲突。相互冲突迫使一种类型的模块主导这些混合序列模型。因此,高精度关键点定位的性能与整体性能不一致。为了缓解这种相互冲突,我们通过并行化自注意模块和卷积模块开发了一种混合并行网络,有效地利用了互补能力。并行网络确保自注意分支倾向于建模远程依赖关系,以增强语义表示,而卷积分支的局部敏感性同时有助于高精度定位。为了进一步减轻冲突,我们提出了一种跨分支注意模块,沿通道维度对两个分支生成的特征进行门控。混合并行网络在 COCO 验证集和测试集上分别达到 75.6%和 75.4%,在高精度定位和整体性能上都具有一致的性能。实验表明,我们的混合并行网络与最先进的人体姿态估计模型相当。