• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于文本的视频分割的演员和动作模块化网络。

Actor and Action Modular Network for Text-Based Video Segmentation.

出版信息

IEEE Trans Image Process. 2022;31:4474-4489. doi: 10.1109/TIP.2022.3185487. Epub 2022 Jul 1.

DOI:10.1109/TIP.2022.3185487
PMID:35763476
Abstract

Text-based video segmentation aims to segment an actor in video sequences by specifying the actor and its performing action with a textual query. Previous methods fail to explicitly align the video content with the textual query in a fine-grained manner according to the actor and its action, due to the problem of semantic asymmetry. The semantic asymmetry implies that two modalities contain different amounts of semantic information during the multi-modal fusion process. To alleviate this problem, we propose a novel actor and action modular network that individually localizes the actor and its action in two separate modules. Specifically, we first learn the actor-/action-related content from the video and textual query, and then match them in a symmetrical manner to localize the target tube. The target tube contains the desired actor and action which is then fed into a fully convolutional network to predict segmentation masks of the actor. Our method also establishes the association of objects cross multiple frames with the proposed temporal proposal aggregation mechanism. This enables our method to segment the video effectively and keep the temporal consistency of predictions. The whole model is allowed for joint learning of the actor-action matching and segmentation, as well as achieves the state-of-the-art performance for both single-frame segmentation and full video segmentation on A2D Sentences and J-HMDB Sentences datasets.

摘要

基于文本的视频分割旨在通过指定演员和其表演动作的文本查询来分割视频序列中的演员。由于语义不对称问题,之前的方法无法根据演员及其动作精细地将视频内容与文本查询对齐。语义不对称意味着在多模态融合过程中,两个模态包含不同数量的语义信息。为了解决这个问题,我们提出了一种新的演员和动作模块网络,该网络将演员和其动作分别定位在两个单独的模块中。具体来说,我们首先从视频和文本查询中学习与演员/动作相关的内容,然后以对称的方式对其进行匹配,以定位目标管。目标管包含所需的演员和动作,然后将其输入到全卷积网络中以预测演员的分割掩模。我们的方法还通过提出的时间提案聚合机制建立了跨多个帧的对象关联。这使得我们的方法能够有效地分割视频,并保持预测的时间一致性。整个模型允许联合学习演员-动作匹配和分割,并且在 A2D Sentences 和 J-HMDB Sentences 数据集上实现了单帧分割和全视频分割的最新性能。

相似文献

1
Actor and Action Modular Network for Text-Based Video Segmentation.基于文本的视频分割的演员和动作模块化网络。
IEEE Trans Image Process. 2022;31:4474-4489. doi: 10.1109/TIP.2022.3185487. Epub 2022 Jul 1.
2
Object-Agnostic Transformers for Video Referring Segmentation.用于视频指称分割的目标无关变压器
IEEE Trans Image Process. 2022;31:2839-2849. doi: 10.1109/TIP.2022.3161832. Epub 2022 Apr 5.
3
Decoupled Cross-Modal Transformer for Referring Video Object Segmentation.用于指称视频对象分割的解耦跨模态变换器
Sensors (Basel). 2024 Aug 20;24(16):5375. doi: 10.3390/s24165375.
4
Referring Segmentation in Images and Videos With Cross-Modal Self-Attention Network.基于跨模态自注意力网络的图像和视频指代分割
IEEE Trans Pattern Anal Mach Intell. 2022 Jul;44(7):3719-3732. doi: 10.1109/TPAMI.2021.3054384. Epub 2022 Jun 3.
5
Query-Adaptive Late Fusion for Hierarchical Fine-Grained Video-Text Retrieval.用于分层细粒度视频-文本检索的查询自适应晚期融合
IEEE Trans Neural Netw Learn Syst. 2022 Oct 24;PP. doi: 10.1109/TNNLS.2022.3214208.
6
Segmentation in Weakly Labeled Videos via a Semantic Ranking and Optical Warping Network.通过语义排序和光流变形网络对弱标注视频进行分割
IEEE Trans Image Process. 2018 May 16. doi: 10.1109/TIP.2018.2834221.
7
A cross-modal conditional mechanism based on attention for text-video retrieval.
Math Biosci Eng. 2023 Nov 3;20(11):20073-20092. doi: 10.3934/mbe.2023889.
8
Video Object Segmentation without Temporal Information.无时间信息的视频对象分割
IEEE Trans Pattern Anal Mach Intell. 2019 Jun;41(6):1515-1530. doi: 10.1109/TPAMI.2018.2838670. Epub 2018 May 23.
9
CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation.用于少样本语义分割的基于CLIP的原型网络
Entropy (Basel). 2023 Sep 18;25(9):1353. doi: 10.3390/e25091353.
10
Text-Based Localization of Moments in a Video Corpus.视频语料库中基于文本的矩定位
IEEE Trans Image Process. 2021;30:8886-8899. doi: 10.1109/TIP.2021.3120038. Epub 2021 Oct 28.

引用本文的文献

1
Decoupled Cross-Modal Transformer for Referring Video Object Segmentation.用于指称视频对象分割的解耦跨模态变换器
Sensors (Basel). 2024 Aug 20;24(16):5375. doi: 10.3390/s24165375.