Suppr超能文献

面向视频目标检测的注意力引导解缠特征聚合。

Attention-Guided Disentangled Feature Aggregation for Video Object Detection.

机构信息

Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany.

Mindgarage, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany.

出版信息

Sensors (Basel). 2022 Nov 7;22(21):8583. doi: 10.3390/s22218583.

Abstract

Object detection is a computer vision task that involves localisation and classification of objects in an image. Video data implicitly introduces several challenges, such as blur, occlusion and defocus, making video object detection more challenging in comparison to still image object detection, which is performed on individual and independent images. This paper tackles these challenges by proposing an attention-heavy framework for video object detection that aggregates the disentangled features extracted from individual frames. The proposed framework is a two-stage object detector based on the Faster R-CNN architecture. The disentanglement head integrates scale, spatial and task-aware attention and applies it to the features extracted by the backbone network across all the frames. Subsequently, the aggregation head incorporates temporal attention and improves detection in the target frame by aggregating the features of the support frames. These include the features extracted from the disentanglement network along with the temporal features. We evaluate the proposed framework using the ImageNet VID dataset and achieve a mean Average Precision (mAP) of 49.8 and 52.5 using the backbones of ResNet-50 and ResNet-101, respectively. The improvement in performance over the individual baseline methods validates the efficacy of the proposed approach.

摘要

目标检测是计算机视觉领域的一项任务,其涉及图像中目标的定位和分类。视频数据隐含着诸多挑战,例如模糊、遮挡和失焦,相较于在独立图像上进行的静态图像目标检测,视频目标检测更为复杂。本文通过提出一种基于 Faster R-CNN 架构的两级目标检测器,来解决这些挑战,该检测器基于注意力机制,对从各个帧中提取的解缠特征进行聚合。解缠头部集成了尺度、空间和任务感知注意力,并将其应用于骨干网络在所有帧中提取的特征。随后,聚合头部引入了时间注意力,并通过聚合支持帧的特征来提高目标帧中的检测效果,这些特征包括从解缠网络中提取的特征和时间特征。我们使用 ImageNet VID 数据集评估了所提出的框架,分别使用 ResNet-50 和 ResNet-101 作为骨干网络,其平均精度(mAP)达到了 49.8 和 52.5。相较于单个基线方法,性能的提升验证了所提出方法的有效性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3581/9658927/1afdd9633f26/sensors-22-08583-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验