Zhang Dewen, Hussain Tahir, An Wangpeng, Shouno Hayaru
Department of Informatics, Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo 182-8585, Japan.
TikTok Inc., 1199 Coleman Ave, San Jose, CA 95110, USA.
Sensors (Basel). 2025 Aug 21;25(16):5213. doi: 10.3390/s25165213.
Current vision-language models (VLMs) are well-adapted for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions due to the lack of specialized vision-language instruction-following data. We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes, enabling more precise understanding of human-centric scenes. Our approach constructs a dataset comprising 200,328 samples tailored to fine-tune models for human-centric tasks, focusing on three areas: conversation, detailed description, and complex reasoning. We establish an Extended Human Pose and Action Understanding Benchmark (E-HPAUB) to assess model performance on human pose and action understanding. We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate our resulting LLaVA-Pose model on the benchmark, achieving significant improvements. Experimental results show an overall improvement of 33.2% compared to the original LLaVA-1.5-7B model. These findings highlight the effectiveness of keypoint-integrated data in enhancing multimodal models for human-centric visual understanding.
当前的视觉语言模型(VLM)非常适合一般的视觉理解任务。然而,由于缺乏专门的视觉语言指令跟随数据,它们在处理与人体姿势和动作相关的复杂视觉任务时表现不佳。我们引入了一种通过将人体关键点与传统视觉特征(如图像字幕和边界框)相结合来生成此类数据的方法,从而能够更精确地理解以人类为中心的场景。我们的方法构建了一个包含200328个样本的数据集,用于针对以人类为中心的任务对模型进行微调,重点关注三个领域:对话、详细描述和复杂推理。我们建立了一个扩展的人体姿势和动作理解基准(E-HPAUB)来评估模型在人体姿势和动作理解方面的性能。我们使用这个数据集对LLaVA-1.5-7B模型进行微调,并在该基准上评估我们得到的LLaVA-Pose模型,取得了显著的改进。实验结果表明,与原始的LLaVA-1.5-7B模型相比,整体提升了33.2%。这些发现突出了关键点集成数据在增强以人类为中心的视觉理解多模态模型方面的有效性。