Lie Wen-Nung, Vann Veasna
Department of Electrical Engineering, Center for Innovative Research on Aging Society (CIRAS), Advanced Institute of Manufacturing with High-Tech Innovations (AIM-HI), National Chung Cheng University, Chia-Yi 621, Taiwan.
Sensors (Basel). 2024 Dec 15;24(24):8017. doi: 10.3390/s24248017.
In computer vision, accurately estimating a 3D human skeleton from a single RGB image remains a challenging task. Inspired by the advantages of multi-view approaches, we propose a method of predicting enhanced 2D skeletons (specifically, predicting the joints' relative depths) from multiple virtual viewpoints based on a single real-view image. By fusing these virtual-viewpoint skeletons, we can then estimate the final 3D human skeleton more accurately. Our network consists of two stages. The first stage is composed of a two-stream network: the Real-Net stream predicts 2D image coordinates and the relative depth for each joint from the real viewpoint, while the Virtual-Net stream estimates the relative depths in virtual viewpoints for the same joints. Our network's second stage consists of a depth-denoising module, a cropped-to-original coordinate transform (COCT) module, and a fusion module. The goal of the fusion module is to fuse skeleton information from the real and virtual viewpoints so that it can undergo feature embedding, 2D-to-3D lifting, and regression to an accurate 3D skeleton. The experimental results demonstrate that our single-view method can achieve a performance of 45.7 mm on average per-joint position error, which is superior to that achieved in several other prior studies of the same kind and is comparable to that of other sequence-based methods that accept tens of consecutive frames as the input.
在计算机视觉中,从单张RGB图像准确估计3D人体骨骼仍然是一项具有挑战性的任务。受多视图方法优势的启发,我们提出了一种基于单张真实视图图像从多个虚拟视角预测增强型2D骨骼(具体而言,预测关节的相对深度)的方法。通过融合这些虚拟视角骨骼,我们能够更准确地估计最终的3D人体骨骼。我们的网络由两个阶段组成。第一阶段由一个双流网络构成:真实网络流从真实视角预测每个关节的2D图像坐标和相对深度,而虚拟网络流估计相同关节在虚拟视角中的相对深度。我们网络的第二阶段由一个深度去噪模块、一个裁剪到原始坐标变换(COCT)模块和一个融合模块组成。融合模块的目标是融合来自真实和虚拟视角的骨骼信息,以便它能够进行特征嵌入、2D到3D提升,并回归到准确的3D骨骼。实验结果表明,我们的单视图方法在平均每个关节位置误差方面能够达到45.7毫米的性能,这优于其他同类先前研究的性能,并且与其他接受数十个连续帧作为输入的基于序列的方法相当。