Li Xin, Lan Cuiling, Wei Guoqiang, Chen Zhibo
IEEE Trans Image Process. 2024;33:5340-5353. doi: 10.1109/TIP.2024.3437212. Epub 2024 Oct 2.
Vision transformer has demonstrated great potential in abundant vision tasks. However, it also inevitably suffers from poor generalization capability when the distribution shift occurs in testing (i.e., out-of-distribution data). To mitigate this issue, we propose a novel method, Semantic-aware Message Broadcasting (SAMB), which enables more informative and flexible feature alignment for unsupervised domain adaptation (UDA). Particularly, we study the attention module in the vision transformer and notice that the alignment space using one global class token lacks enough flexibility, where it interacts information with all image tokens in the same manner but ignores the rich semantics of different regions. In this paper, we aim to improve the richness of the alignment features by enabling semantic-aware adaptive message broadcasting. Particularly, we introduce a group of learned group tokens as nodes to aggregate the global information from all image tokens, but encourage different group tokens to adaptively focus on the message broadcasting to different semantic regions. In this way, our message broadcasting encourages the group tokens to learn more informative and diverse information for effective domain alignment. Moreover, we systematically study the effects of adversarial-based feature alignment (ADA) and pseudo-label based self-training (PST) on UDA. We find that one simple two-stage training strategy with the cooperation of ADA and PST can further improve the adaptation capability of the vision transformer. Extensive experiments on DomainNet, OfficeHome, and VisDA-2017 demonstrate the effectiveness of our methods for UDA.
视觉Transformer在众多视觉任务中展现出了巨大潜力。然而,当测试中出现分布偏移(即分布外数据)时,它也不可避免地存在泛化能力较差的问题。为缓解这一问题,我们提出了一种新颖的方法,即语义感知消息广播(SAMB),它能够为无监督域适应(UDA)实现更具信息性和灵活性的特征对齐。具体而言,我们研究了视觉Transformer中的注意力模块,发现使用单个全局类标记的对齐空间缺乏足够的灵活性,它以相同的方式与所有图像标记交互信息,但忽略了不同区域丰富的语义。在本文中,我们旨在通过实现语义感知自适应消息广播来提高对齐特征的丰富性。具体来说,我们引入一组经过学习的组标记作为节点,以聚合来自所有图像标记的全局信息,但鼓励不同的组标记自适应地专注于向不同语义区域的消息广播。通过这种方式,我们的消息广播促使组标记学习更具信息性和多样性的信息,以实现有效的域对齐。此外,我们系统地研究了基于对抗的特征对齐(ADA)和基于伪标签的自训练(PST)对UDA的影响。我们发现,一种简单的两阶段训练策略,结合ADA和PST,可以进一步提高视觉Transformer的适应能力。在DomainNet、OfficeHome和VisDA - 2017上进行的大量实验证明了我们的方法在UDA中的有效性。