Ma Yiwei, Ji Jiayi, Sun Xiaoshuai, Zhou Yiyi, Hong Xiaopeng, Wu Yongjian, Ji Rongrong
IEEE Trans Neural Netw Learn Syst. 2025 Apr;36(4):6203-6217. doi: 10.1109/TNNLS.2024.3409354. Epub 2025 Apr 4.
This article explores a novel dynamic network for vision and language (V&L) tasks, where the inferring structure is customized on the fly for different inputs. Most previous state-of-the-art (SOTA) approaches are static and handcrafted networks, which not only heavily rely on expert knowledge but also ignore the semantic diversity of input samples, therefore resulting in suboptimal performance. To address these issues, we propose a novel Dynamic Transformer Network (DTNet) for image captioning, which dynamically assigns customized paths to different samples, leading to discriminative yet accurate captions. Specifically, to build a rich routing space and improve routing efficiency, we introduce five types of basic cells and group them into two separate routing spaces according to their operating domains, i.e., spatial and channel. Then, we design a Spatial-Channel Joint Router (SCJR), which endows the model with the capability of path customization based on both spatial and channel information of the input sample. To validate the effectiveness of our proposed DTNet, we conduct extensive experiments on the MS-COCO dataset and achieve new SOTA performance on both the Karpathy split and the online test server. The source code is publicly available at https://github.com/xmu-xiaoma666/DTNet.
本文探索了一种用于视觉与语言(V&L)任务的新型动态网络,其中推理结构会针对不同输入即时定制。大多数先前的最先进(SOTA)方法都是静态的手工制作网络,它们不仅严重依赖专家知识,还忽略了输入样本的语义多样性,因此导致性能欠佳。为解决这些问题,我们提出了一种用于图像字幕的新型动态变压器网络(DTNet),它能为不同样本动态分配定制路径,从而生成有区分力且准确的字幕。具体而言,为构建丰富的路由空间并提高路由效率,我们引入了五种基本单元,并根据其操作域将它们分组为两个单独的路由空间,即空间和通道。然后,我们设计了一种空间-通道联合路由器(SCJR),它赋予模型基于输入样本的空间和通道信息进行路径定制的能力。为验证我们提出的DTNet的有效性,我们在MS-COCO数据集上进行了广泛实验,并在Karpathy划分和在线测试服务器上均取得了新的SOTA性能。源代码可在https://github.com/xmu-xiaoma666/DTNet上公开获取。