文献检索，用中文搜 PubMed

Object detection technology plays a crucial role in people's everyday lives, as well as enterprise production and modern national defense. Most current object detection networks, such as YOLOX, employ convolutional neural networks instead of a Transformer as a backbone. However, these techniques lack a global understanding of the images and may lose meaningful information, such as the precise location of the most active feature detector. Recently, a Transformer with larger receptive fields showed superior performance to corresponding convolutional neural networks in computer vision tasks. The Transformer splits the image into patches and subsequently feeds them to the Transformer in a sequence structure similar to word embeddings. This makes it capable of global modeling of entire images and implies global understanding of images. However, simply using a Transformer with a larger receptive field raises several concerns. For example, self-attention in the Swin Transformer backbone will limit its ability to model long range relations, resulting in poor feature extraction results and low convergence speed during training. To address the above problems, first, we propose an important region-based Reconstructed Deformable Self-Attention that shifts attention to important regions for efficient global modeling. Second, based on the Reconstructed Deformable Self-Attention, we propose the Swin Deformable Transformer backbone, which improves the feature extraction ability and convergence speed. Finally, based on the Swin Deformable Transformer backbone, we propose a novel object detection network, namely, Swin Deformable Transformer-BiPAFPN-YOLOX. experimental results on the COCO dataset show that the training period is reduced by 55.4%, average precision is increased by 2.4%, average precision of small objects is increased by 3.7%, and inference speed is increased by 35%.

目标检测技术在人们的日常生活、企业生产和现代国防中都起着至关重要的作用。目前大多数目标检测网络，如 YOLOX，都使用卷积神经网络作为骨干，而不是 Transformer。但是这些技术缺乏对图像的全局理解，可能会丢失有意义的信息，例如最活跃的特征检测器的精确位置。最近，具有更大感受野的 Transformer 在计算机视觉任务中表现出优于相应卷积神经网络的性能。Transformer 将图像分割成补丁，然后将它们按类似于词嵌入的序列结构输入到 Transformer 中。这使得它能够对整个图像进行全局建模，并对图像有全局理解。但是，仅仅使用具有更大感受野的 Transformer 会引发一些问题。例如，Swin Transformer 骨干中的自注意力将限制其对长程关系进行建模的能力，从而导致特征提取结果不佳，训练过程中收敛速度较慢。为了解决上述问题，我们首先提出了一种重要的基于区域的重构变形自注意力，将注意力转移到重要区域，以实现高效的全局建模。其次，基于重构变形自注意力，我们提出了 Swin 变形 Transformer 骨干，提高了特征提取能力和收敛速度。最后，基于 Swin 变形 Transformer 骨干，我们提出了一种新颖的目标检测网络，即 Swin 变形 Transformer-BiPAFPN-YOLOX。在 COCO 数据集上的实验结果表明，训练周期减少了 55.4%，平均精度提高了 2.4%，小目标的平均精度提高了 3.7%，推理速度提高了 35%。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

基于 Swin 变形 Transformer-BiPAFPN-YOLOX 的目标检测。

Object Detection Based on Swin Deformable Transformer-BiPAFPN-YOLOX.

机构信息

出版信息

相似文献