• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

LLaVA-Pose:用于人体姿态与动作理解的关键点集成指令微调

LLaVA-Pose: Keypoint-Integrated Instruction Tuning for Human Pose and Action Understanding.

作者信息

Zhang Dewen, Hussain Tahir, An Wangpeng, Shouno Hayaru

机构信息

Department of Informatics, Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo 182-8585, Japan.

TikTok Inc., 1199 Coleman Ave, San Jose, CA 95110, USA.

出版信息

Sensors (Basel). 2025 Aug 21;25(16):5213. doi: 10.3390/s25165213.

DOI:10.3390/s25165213
PMID:40872075
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12390531/
Abstract

Current vision-language models (VLMs) are well-adapted for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions due to the lack of specialized vision-language instruction-following data. We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes, enabling more precise understanding of human-centric scenes. Our approach constructs a dataset comprising 200,328 samples tailored to fine-tune models for human-centric tasks, focusing on three areas: conversation, detailed description, and complex reasoning. We establish an Extended Human Pose and Action Understanding Benchmark (E-HPAUB) to assess model performance on human pose and action understanding. We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate our resulting LLaVA-Pose model on the benchmark, achieving significant improvements. Experimental results show an overall improvement of 33.2% compared to the original LLaVA-1.5-7B model. These findings highlight the effectiveness of keypoint-integrated data in enhancing multimodal models for human-centric visual understanding.

摘要

当前的视觉语言模型(VLM)非常适合一般的视觉理解任务。然而,由于缺乏专门的视觉语言指令跟随数据,它们在处理与人体姿势和动作相关的复杂视觉任务时表现不佳。我们引入了一种通过将人体关键点与传统视觉特征(如图像字幕和边界框)相结合来生成此类数据的方法,从而能够更精确地理解以人类为中心的场景。我们的方法构建了一个包含200328个样本的数据集,用于针对以人类为中心的任务对模型进行微调,重点关注三个领域:对话、详细描述和复杂推理。我们建立了一个扩展的人体姿势和动作理解基准(E-HPAUB)来评估模型在人体姿势和动作理解方面的性能。我们使用这个数据集对LLaVA-1.5-7B模型进行微调,并在该基准上评估我们得到的LLaVA-Pose模型,取得了显著的改进。实验结果表明,与原始的LLaVA-1.5-7B模型相比,整体提升了33.2%。这些发现突出了关键点集成数据在增强以人类为中心的视觉理解多模态模型方面的有效性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aca7/12390531/34834c365eac/sensors-25-05213-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aca7/12390531/b7891b00bbed/sensors-25-05213-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aca7/12390531/34834c365eac/sensors-25-05213-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aca7/12390531/b7891b00bbed/sensors-25-05213-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aca7/12390531/34834c365eac/sensors-25-05213-g002.jpg

相似文献

1
LLaVA-Pose: Keypoint-Integrated Instruction Tuning for Human Pose and Action Understanding.LLaVA-Pose:用于人体姿态与动作理解的关键点集成指令微调
Sensors (Basel). 2025 Aug 21;25(16):5213. doi: 10.3390/s25165213.
2
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
3
FLAG3D++: A Benchmark for 3D Fitness Activity Comprehension With Language Instruction.
IEEE Trans Pattern Anal Mach Intell. 2025 Nov;47(11):9731-9748. doi: 10.1109/TPAMI.2025.3590012.
4
A Multimodal Large Language Model as an End-to-End Classifier of Thyroid Nodule Malignancy Risk: Usability Study.一种作为甲状腺结节恶性风险端到端分类器的多模态大语言模型:可用性研究
JMIR Form Res. 2025 Aug 19;9:e70863. doi: 10.2196/70863.
5
Evaluating and Improving Syndrome Differentiation Thinking Ability in Large Language Models: Method Development Study.评估和提高大语言模型中的辨证思维能力:方法开发研究
JMIR Med Inform. 2025 Jun 20;13:e75103. doi: 10.2196/75103.
6
PoseScript: Linking 3D Human Poses and Natural Language.姿态脚本:连接3D人体姿态与自然语言
IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5146-5159. doi: 10.1109/TPAMI.2024.3407570.
7
Short-Term Memory Impairment短期记忆障碍
8
Fine-tuning medical language models for enhanced long-contextual understanding and domain expertise.微调医学语言模型以增强长上下文理解和领域专业知识。
Quant Imaging Med Surg. 2025 Jun 6;15(6):5450-5462. doi: 10.21037/qims-2024-2655. Epub 2025 Jun 3.
9
BioInstruct: instruction tuning of large language models for biomedical natural language processing.BioInstruct:用于生物医学自然语言处理的大型语言模型的指令调整。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1821-1832. doi: 10.1093/jamia/ocae122.
10
How well do multimodal LLMs interpret CT scans? An auto-evaluation framework for analyses.多模态语言模型对CT扫描的解读效果如何?一个用于分析的自动评估框架。
J Biomed Inform. 2025 Aug;168:104864. doi: 10.1016/j.jbi.2025.104864. Epub 2025 Jun 25.

本文引用的文献

1
Learning spatio-temporal context for basketball action pose estimation with a multi-stream network.使用多流网络学习篮球动作姿态估计的时空上下文。
Sci Rep. 2025 Aug 9;15(1):29173. doi: 10.1038/s41598-025-14985-y.
2
Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition.用于基于骨架的动作识别的双流时空GCN-Transformer网络
Sci Rep. 2025 Feb 10;15(1):4982. doi: 10.1038/s41598-025-87752-8.
3
PoseScript: Linking 3D Human Poses and Natural Language.姿态脚本:连接3D人体姿态与自然语言
IEEE Trans Pattern Anal Mach Intell. 2025 Jul;47(7):5146-5159. doi: 10.1109/TPAMI.2024.3407570.
4
Motion sensitive network for action recognition in control and decision-making of autonomous systems.用于自主系统控制与决策中动作识别的运动敏感网络。
Front Neurosci. 2024 Mar 25;18:1370024. doi: 10.3389/fnins.2024.1370024. eCollection 2024.
5
Video-Based Human Activity Recognition Using Deep Learning Approaches.基于视频的深度学习人体活动识别。
Sensors (Basel). 2023 Jul 13;23(14):6384. doi: 10.3390/s23146384.
6
Human Action Recognition: A Taxonomy-Based Survey, Updates, and Opportunities.人体动作识别:基于分类的综述、更新与机遇。
Sensors (Basel). 2023 Feb 15;23(4):2182. doi: 10.3390/s23042182.