Zhang Jiajun, Zhang Yuxiang, Zhang Hongwen, Zhou Xiao, Zhou Boyao, Shao Ruizhi, Hu Zonghai, Liu Yebin
IEEE Trans Pattern Anal Mach Intell. 2025 Nov;47(11):9655-9672. doi: 10.1109/TPAMI.2025.3588268.
Accurately modeling detailed interactions between human/hand and object is an appealing yet challenging task. Current multi-view capture systems are only capable of reconstructing multiple subjects into a single, unified mesh, which fails to model the states of each instance individually during interactions. To address this, previous methods use template-based representations to track human/hand and object. However, the quality of the reconstructions is limited by the descriptive capabilities of the templates so these methods inherently struggle with geometric details, pressing deformations and invisible contact surfaces. In this work, we propose an end-to-end Instance-aware Human-Object Interactions recovery (Ins-HOI) framework by introducing an instance-level occupancy field representation. However, the real-captured data is presented as a holistic mesh, unable to provide instance-level supervision. To address this, we further propose a complementary training strategy that leverages synthetic data to introduce instance-level shape priors, enabling the disentanglement of occupancy fields for different instances. Specifically, synthetic data, created by randomly combining individual scans of humans/hands and objects, guides the network to learn a coarse prior of instances. Meanwhile, real-captured data helps in learning the overall geometry and restricting interpenetration in contact areas. As demonstrated in experiments, our method Ins-HOI supports instance-level reconstruction and provides reasonable and realistic invisible contact surfaces even in cases of extremely close interaction. To facilitate research on this task, we collect a large-scale, high-fidelity 3D scan dataset, including 5.2 k high-quality scans with real-world human-chair and hand-object interactions. The code and data will be public for research purposes.
准确地对人类/手部与物体之间的详细交互进行建模是一项颇具吸引力但又具有挑战性的任务。当前的多视图捕捉系统仅能够将多个主体重建为一个单一的、统一的网格模型,这无法在交互过程中单独对每个实例的状态进行建模。为了解决这个问题,先前的方法使用基于模板的表示来跟踪人类/手部和物体。然而,重建的质量受到模板描述能力的限制,因此这些方法在处理几何细节、紧迫变形和不可见接触表面时存在固有困难。在这项工作中,我们通过引入实例级占用场表示,提出了一个端到端的实例感知人类-物体交互恢复(Ins-HOI)框架。然而,实际捕捉到的数据是以整体网格的形式呈现的,无法提供实例级别的监督。为了解决这个问题,我们进一步提出了一种补充训练策略,该策略利用合成数据来引入实例级形状先验,从而能够解开不同实例的占用场。具体而言,通过随机组合人类/手部和物体的个体扫描创建的合成数据,引导网络学习实例的粗略先验。同时,实际捕捉到的数据有助于学习整体几何形状并限制接触区域的相互穿透。如实验所示,我们的方法Ins-HOI支持实例级重建,即使在极其紧密的交互情况下,也能提供合理且逼真的不可见接触表面。为了促进对此任务的研究,我们收集了一个大规模、高保真的3D扫描数据集,包括5200个具有真实世界中人类与椅子以及手部与物体交互的高质量扫描数据。代码和数据将公开用于研究目的。