IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):12798-12815. doi: 10.1109/TPAMI.2022.3216899. Epub 2023 Oct 3.
Training state-of-the-art models for human pose estimation in videos requires datasets with annotations that are really hard and expensive to obtain. Although transformers have been recently utilized for body pose sequence modeling, related methods rely on pseudo-ground truth to augment the currently limited training data available for learning such models. In this paper, we introduce PoseBERT, a transformer module that is fully trained on 3D Motion Capture (MoCap) data via masked modeling. It is simple, generic and versatile, as it can be plugged on top of any image-based model to transform it in a video-based model leveraging temporal information. We showcase variants of PoseBERT with different inputs varying from 3D skeleton keypoints to rotations of a 3D parametric model for either the full body (SMPL) or just the hands (MANO). Since PoseBERT training is task agnostic, the model can be applied to several tasks such as pose refinement, future pose prediction or motion completion without finetuning. Our experimental results validate that adding PoseBERT on top of various state-of-the-art pose estimation methods consistently improves their performances, while its low computational cost allows us to use it in a real-time demo for smoothly animating a robotic hand via a webcam. Test code and models are available at https://github.com/naver/posebert.
训练用于视频中人体姿态估计的最先进模型需要具有注释的数据集,而这些注释非常难以获取且成本高昂。尽管最近已经使用了转换器来进行身体姿态序列建模,但相关方法依赖于伪真实数据来扩充目前有限的用于学习此类模型的训练数据。在本文中,我们引入了 PoseBERT,这是一个完全通过屏蔽建模在 3D 运动捕捉 (MoCap) 数据上进行训练的转换器模块。它简单、通用且功能多样,因为它可以在任何基于图像的模型上进行插入,以利用时间信息将其转换为基于视频的模型。我们展示了具有不同输入的 PoseBERT 变体,从 3D 骨骼关键点到 3D 参数模型的旋转,这些模型适用于全身 (SMPL) 或仅手部 (MANO)。由于 PoseBERT 的训练是任务不可知的,因此可以将该模型应用于各种任务,例如姿态细化、未来姿态预测或运动补全,而无需微调。我们的实验结果验证了在各种最先进的姿态估计方法之上添加 PoseBERT 可以一致地提高它们的性能,而其低计算成本允许我们在实时演示中通过网络摄像头平滑地为机械手进行动画处理。测试代码和模型可在 https://github.com/naver/posebert 上获得。