基于几何的三维人体姿态估计自监督学习。

Geometry-driven self-supervision for 3D human pose estimation.

机构信息

Department of Artificial Intelligence, Korea University, Anam-ro 145, Seongbuk-gu, Seoul, Republic of Korea.

出版信息

Neural Netw. 2024 Jun;174:106237. doi: 10.1016/j.neunet.2024.106237. Epub 2024 Mar 14.

DOI:10.1016/j.neunet.2024.106237

Abstract

Although 3D human pose estimation has recently made strides, it is still difficult to precisely recreate a 3D human posture from a single image without the aid of 3D annotation for the following reasons. Firstly, the process of reconstruction inherently suffers from ambiguity, as multiple 3D poses can be projected onto the same 2D pose. Secondly, accurately measuring camera rotation without laborious camera calibration is a difficult task. While some approaches attempt to address these issues using traditional computer vision algorithms, they are not differentiable and cannot be optimized through training. This paper introduces two modules that explicitly leverage geometry to overcome these challenges, without requiring any 3D ground-truth or camera parameters. The first module, known as the relative depth estimation module, effectively mitigates depth ambiguity by narrowing down the possible depths for each joint to only two candidates. The second module, referred to as the differentiable pose alignment module, calculates camera rotation by aligning poses from different views. The use of these geometrically interpretable modules reduces the complexity of training and yields superior performance. By adopting our proposed method, we achieve state-of-the-art results on standard benchmark datasets, surpassing other self-supervised methods and even outperforming several fully-supervised approaches that heavily rely on 3D annotations.

摘要

尽管 3D 人体姿态估计最近取得了很大的进展，但仍然难以在没有 3D 注释的情况下仅通过单张图像精确重建 3D 人体姿势，原因如下。首先，重建过程本身存在歧义，因为多个 3D 姿势可以投影到同一 2D 姿势上。其次，准确测量没有繁琐相机校准的相机旋转是一项困难的任务。虽然一些方法试图使用传统的计算机视觉算法来解决这些问题，但它们不是可微的，也不能通过训练进行优化。本文介绍了两个模块，它们明确利用几何知识来克服这些挑战，而无需任何 3D 地面真实或相机参数。第一个模块称为相对深度估计模块，通过将每个关节的可能深度缩小到仅两个候选值，有效地减轻了深度歧义。第二个模块称为可微分姿态对齐模块，通过对齐来自不同视图的姿态来计算相机旋转。使用这些几何可解释模块降低了训练的复杂性，并取得了卓越的性能。通过采用我们提出的方法，我们在标准基准数据集上实现了最先进的结果，超越了其他自监督方法，甚至超过了一些严重依赖 3D 注释的全监督方法。