Honari Sina, Constantin Victor, Rhodin Helge, Salzmann Mathieu, Fua Pascal
IEEE Trans Pattern Anal Mach Intell. 2023 May;45(5):6415-6427. doi: 10.1109/TPAMI.2022.3215307. Epub 2023 Apr 3.
In this article we propose an unsupervised feature extraction method to capture temporal information on monocular videos, where we detect and encode subject of interest in each frame and leverage contrastive self-supervised (CSS) learning to extract rich latent vectors. Instead of simply treating the latent features of nearby frames as positive pairs and those of temporally-distant ones as negative pairs as in other CSS approaches, we explicitly disentangle each latent vector into a time-variant component and a time-invariant one. We then show that applying contrastive loss only to the time-variant features and encouraging a gradual transition on them between nearby and away frames while also reconstructing the input, extract rich temporal features, well-suited for human pose estimation. Our approach reduces error by about 50% compared to the standard CSS strategies, outperforms other unsupervised single-view methods and matches the performance of multi-view techniques. When 2D pose is available, our approach can extract even richer latent features and improve the 3D pose estimation accuracy, outperforming other state-of-the-art weakly supervised methods.
在本文中,我们提出了一种无监督特征提取方法,用于捕捉单目视频中的时间信息,即我们在每一帧中检测并编码感兴趣的对象,并利用对比自监督(CSS)学习来提取丰富的潜在向量。与其他CSS方法不同,我们不是简单地将相邻帧的潜在特征视为正例对,将时间上较远的帧的潜在特征视为负例对,而是明确地将每个潜在向量分解为一个时变分量和一个时不变分量。然后我们表明,仅对时变特征应用对比损失,并鼓励它们在相邻帧和远离帧之间进行渐进过渡,同时重建输入,可以提取出丰富的时间特征,非常适合人体姿态估计。与标准CSS策略相比,我们的方法将误差降低了约50%,优于其他无监督单视图方法,并与多视图技术的性能相匹配。当有二维姿态可用时,我们的方法可以提取更丰富的潜在特征并提高三维姿态估计精度,优于其他最新的弱监督方法。