用于视频中跨模态人物重识别的通道混洗变压器

Channel-shuffled transformers for cross-modality person re-identification in video.

作者信息

Kasantikul Rangwan, Kusakunniran Worapan, Wu Qiang, Wang Zhiyong

机构信息

Faculty of Information and Communication Technology, Mahidol University, 999 Phuttamonthon 4 Road, Salaya, 73170, Nakhon Pathom, Thailand.

School of Computer Science, The University of Sydney, Camperdown, 2006, New South Wales, Australia.

出版信息

Sci Rep. 2025 Apr 29;15(1):15009. doi: 10.1038/s41598-025-00063-w.

DOI:10.1038/s41598-025-00063-w

PMID:40301413

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12041324/

Abstract

Effective implementation of person re-identification (Re-ID) across different modalities (such as daylight vs night-vision) is crucial for Surveillance applications. Information from multiple frames is essential for effective re-identification, where visual components from individual frames become less reliable. While transformers can enhance the temporal information extraction, the large number of channels required for effective feature encoding introduces scaling challenges. This could lead to overfitting and instability during training. Therefore, we proposed a novel Channel-Shuffled Temporal Transformer (CSTT) for processing multi-frame sequences in conjunction with a ResNet backbone to form Hybrid Channel-Shuffled Transformer Net (HCSTNET). Replacing fully connected layers in standard multi-head attention with ShuffleNet-like structures is important for integration of transformer attention with a ResNet backbone. Applying ShuffleNet-like structures reduces overfitting through parameter reduction with channel-grouping, and further improves learned attention using channel-shuffling. According to our tests with the SYSU-MM01 dataset in comparison against simple averaging of multiple frames, only the temporal transformer with channel-shuffling achieved a measurable improvement over the baseline. We have also investigated the optimal partitioning of feature maps therein.

摘要

在监控应用中，有效地跨不同模态（如日光与夜视）实现人员重新识别（Re-ID）至关重要。来自多个帧的信息对于有效的重新识别必不可少，而单个帧的视觉组件可靠性较低。虽然Transformer可以增强时间信息提取，但有效特征编码所需的大量通道带来了缩放挑战。这可能导致训练期间的过拟合和不稳定性。因此，我们提出了一种新颖的通道混洗时间Transformer（CSTT），用于结合ResNet主干处理多帧序列，以形成混合通道混洗Transformer网络（HCSTNET）。用类似ShuffleNet的结构替换标准多头注意力中的全连接层对于将Transformer注意力与ResNet主干集成很重要。应用类似ShuffleNet的结构通过通道分组减少参数来减少过拟合，并通过通道混洗进一步改善学习到的注意力。根据我们在SYSU-MM01数据集上与多帧简单平均的比较测试，只有具有通道混洗的时间Transformer相对于基线实现了可测量的改进。我们还研究了其中特征图的最佳划分。