Institute of Artificial Intelligence and Robotics, Xian Jiaotong University, Xian, Shaanxi 710049, P. R. China.
Int J Neural Syst. 2023 Jul;33(7):2350035. doi: 10.1142/S0129065723500351. Epub 2023 Jun 14.
Zero-shot detection (ZSD) aims to locate and classify unseen objects in pictures or videos by semantic auxiliary information without additional training examples. Most of the existing ZSD methods are based on two-stage models, which achieve the detection of unseen classes by aligning object region proposals with semantic embeddings. However, these methods have several limitations, including poor region proposals for unseen classes, lack of consideration of semantic representations of unseen classes or their inter-class correlations, and domain bias towards seen classes, which can degrade overall performance. To address these issues, the Trans-ZSD framework is proposed, which is a transformer-based multi-scale contextual detection framework that explicitly exploits inter-class correlations between seen and unseen classes and optimizes feature distribution to learn discriminative features. Trans-ZSD is a single-stage approach that skips proposal generation and performs detection directly, allowing the encoding of long-term dependencies at multiple scales to learn contextual features while requiring fewer inductive biases. Trans-ZSD also introduces a foreground-background separation branch to alleviate the confusion of unseen classes and backgrounds, contrastive learning to learn inter-class uniqueness and reduce misclassification between similar classes, and explicit inter-class commonality learning to facilitate generalization between related classes. Trans-ZSD addresses the domain bias problem in end-to-end generalized zero-shot detection (GZSD) models by using balance loss to maximize response consistency between seen and unseen predictions, ensuring that the model does not bias towards seen classes. The Trans-ZSD framework is evaluated on the PASCAL VOC and MS COCO datasets, demonstrating significant improvements over existing ZSD models.
零样本检测(ZSD)旨在通过语义辅助信息在图片或视频中定位和分类未见对象,而无需额外的训练示例。大多数现有的 ZSD 方法基于两阶段模型,通过将对象区域提议与语义嵌入对齐来实现未见类别的检测。然而,这些方法存在几个限制,包括对未见类别的区域提议不佳、缺乏对未见类别的语义表示或它们之间的类间相关性的考虑以及对可见类别的领域偏差,这可能会降低整体性能。为了解决这些问题,提出了 Trans-ZSD 框架,这是一个基于转换器的多尺度上下文检测框架,它明确利用了可见类和未见类之间的类间相关性,并优化特征分布以学习判别特征。Trans-ZSD 是一种单阶段方法,跳过了提案生成并直接执行检测,允许在多个尺度上对长期依赖关系进行编码,以学习上下文特征,同时需要较少的归纳偏差。Trans-ZSD 还引入了一个前景-背景分离分支,以减轻未见类和背景之间的混淆,对比学习以学习类间独特性并减少相似类之间的误分类,以及显式类间共性学习以促进相关类之间的泛化。Trans-ZSD 通过使用平衡损失来最大化可见和未见预测之间的响应一致性,解决了端到端广义零样本检测(GZSD)模型中的领域偏差问题,确保模型不会偏向可见类。在 PASCAL VOC 和 MS COCO 数据集上评估了 Trans-ZSD 框架,与现有的 ZSD 模型相比,该框架取得了显著的改进。