VPN++：重新思考视频姿态嵌入以理解日常生活活动。

VPN++: Rethinking Video-Pose Embeddings for Understanding Activities of Daily Living.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):9703-9717. doi: 10.1109/TPAMI.2021.3127885. Epub 2022 Nov 7.

DOI:10.1109/TPAMI.2021.3127885

Abstract

Many attempts have been made towards combining RGB and 3D poses for the recognition of Activities of Daily Living (ADL). ADL may look very similar and often necessitate to model fine-grained details to distinguish them. Because the recent 3D ConvNets are too rigid to capture the subtle visual patterns across an action, this research direction is dominated by methods combining RGB and 3D Poses. But the cost of computing 3D poses from RGB stream is high in the absence of appropriate sensors. This limits the usage of aforementioned approaches in real-world applications requiring low latency. Then, how to best take advantage of 3D Poses for recognizing ADL? To this end, we propose an extension of a pose driven attention mechanism: Video-Pose Network (VPN), exploring two distinct directions. One is to transfer the Pose knowledge into RGB through a feature-level distillation and the other towards mimicking pose driven attention through an attention-level distillation. Finally, these two approaches are integrated into a single model, we call VPN++. It is worth noting that VPN++ exploits the pose embeddings at training via distillation but not at inference. We show that VPN++ is not only effective but also provides a high speed up and high resilience to noisy Poses. VPN++, with or without 3D Poses, outperforms the representative baselines on 4 public datasets. Code is available at https://github.com/srijandas07/vpnplusplus.

摘要

许多人尝试将 RGB 和 3D 姿势结合起来，以识别日常生活活动（ADL）。ADL 可能看起来非常相似，通常需要建模精细的细节来区分它们。由于最近的 3D ConvNets 过于僵化，无法捕捉动作中的微妙视觉模式，因此该研究方向主要由结合 RGB 和 3D 姿势的方法主导。但是，在没有适当传感器的情况下，从 RGB 流计算 3D 姿势的成本很高。这限制了上述方法在需要低延迟的实际应用中的使用。那么，如何最好地利用 3D 姿势来识别 ADL 呢？为此，我们提出了一种姿态驱动注意力机制的扩展：视频姿态网络（VPN），探索了两个不同的方向。一种是通过特征级蒸馏将姿势知识转移到 RGB 中，另一种是通过注意力级蒸馏模仿姿势驱动的注意力。最后，这两种方法集成到一个单一的模型中，我们称之为 VPN++。值得注意的是，VPN++在训练时通过蒸馏利用姿态嵌入，但在推理时不利用。我们表明，VPN++不仅有效，而且对噪声姿态具有较高的加速和弹性。无论是否有 3D 姿势，VPN++在 4 个公共数据集上的表现都优于代表性基线。代码可在 https://github.com/srijandas07/vpnplusplus 获得。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

VPN++：重新思考视频姿态嵌入以理解日常生活活动。

VPN++: Rethinking Video-Pose Embeddings for Understanding Activities of Daily Living.

出版信息

相似文献

引用本文的文献

VPN++：重新思考视频姿态嵌入以理解日常生活活动。

VPN++: Rethinking Video-Pose Embeddings for Understanding Activities of Daily Living.

出版信息

相似文献

引用本文的文献