Wang Yujiang, Dong Mingzhi, Shen Jie, Luo Yiming, Lin Yiming, Ma Pingchuan, Petridis Stavros, Pantic Maja
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):12944-12959. doi: 10.1109/TPAMI.2023.3243812. Epub 2023 Oct 3.
This article presents a novel method for face clustering in videos using a video-centralised transformer. Previous works often employed contrastive learning to learn frame-level representation and used average pooling to aggregate the features along the temporal dimension. This approach may not fully capture the complicated video dynamics. In addition, despite the recent progress in video-based contrastive learning, few have attempted to learn a self-supervised clustering-friendly face representation that benefits the video face clustering task. To overcome these limitations, our method employs a transformer to directly learn video-level representations that can better reflect the temporally-varying property of faces in videos, while we also propose a video-centralised self-supervised framework to train the transformer model. We also investigate face clustering in egocentric videos, a fast-emerging field that has not been studied yet in works related to face clustering. To this end, we present and release the first large-scale egocentric video face clustering dataset named EasyCom-Clustering. We evaluate our proposed method on both the widely used Big Bang Theory (BBT) dataset and the new EasyCom-Clustering dataset. Results show the performance of our video-centralised transformer has surpassed all previous state-of-the-art methods on both benchmarks, exhibiting a self-attentive understanding of face videos.
本文提出了一种使用视频集中式变压器进行视频中人脸聚类的新方法。以往的工作通常采用对比学习来学习帧级表示,并使用平均池化来聚合时间维度上的特征。这种方法可能无法完全捕捉复杂的视频动态。此外,尽管基于视频的对比学习最近取得了进展,但很少有人尝试学习一种对聚类友好的自监督人脸表示,以利于视频人脸聚类任务。为了克服这些限制,我们的方法采用变压器直接学习视频级表示,这种表示可以更好地反映视频中人脸的时变特性,同时我们还提出了一个视频集中式自监督框架来训练变压器模型。我们还研究了第一人称视角视频中的人脸聚类,这是一个快速发展的领域,在与人脸聚类相关的工作中尚未得到研究。为此,我们展示并发布了第一个名为EasyCom-Clustering的大规模第一人称视角视频人脸聚类数据集。我们在广泛使用的《生活大爆炸》(BBT)数据集和新的EasyCom-Clustering数据集上评估我们提出的方法。结果表明,我们的视频集中式变压器在两个基准测试中的性能都超过了之前所有的先进方法,展现了对人脸视频的自注意力理解。