• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过多模态语言模型实现的社会推理感知轨迹预测

Social Reasoning-Aware Trajectory Prediction via Multimodal Language Model.

作者信息

Bae Inhwan, Lee Junoh, Jeon Hae-Gon

出版信息

IEEE Trans Pattern Anal Mach Intell. 2025 Jun 20;PP. doi: 10.1109/TPAMI.2025.3582000.

DOI:10.1109/TPAMI.2025.3582000
PMID:40540377
Abstract

Recent advancements in language models have demonstrated its capacity of context understanding and generative representations. Leveraged by these developments, we propose a novel multimodal trajectory predictor based on a vision-language model, named VLMTraj, which fully takes advantage of the prior knowledge of multimodal large language models and the human-like reasoning across diverse modality information. The key idea of our model is to reframe the trajectory prediction task into a visual question answering format, using historical information as context and instructing the language model to make predictions in a conversational manner. Specifically, we transform all the inputs into a natural language style: historical trajectories are converted into text prompts, and scene images are described through image captioning. Additionally, visual features from input images are also transformed into tokens via a modality encoder and connector. The transformed data is then formatted to be used in a language model. Next, in order to guide the language model in understanding and reasoning high-level knowledge, such as scene context and social relationships between pedestrians, we introduce an auxiliary multi-task question and answers. For training, we first optimize a numerical tokenizer with the prompt data to effectively separate integer and decimal parts, allowing us to capture correlations between consecutive numbers in the language model. We then train our language model using all the visual question answering prompts. During model inference, we implement both deterministic and stochastic prediction methods through beam-search-based most-likely prediction and temperature-based multimodal generation. Our VLMTraj validates that the language-based model can be a powerful pedestrian trajectory predictor, and outperforms existing numerical-based predictor methods. Extensive experiments show that VLMTraj can successfully understand social relationships and accurately extrapolate the multimodal futures on public pedestrian trajectory prediction benchmarks.

摘要

语言模型的最新进展展示了其上下文理解和生成式表征的能力。受这些进展的推动,我们提出了一种基于视觉语言模型的新型多模态轨迹预测器,名为VLMTraj,它充分利用了多模态大语言模型的先验知识以及跨不同模态信息的类人推理。我们模型的关键思想是将轨迹预测任务重新构建为视觉问答格式,使用历史信息作为上下文,并指导语言模型以对话方式进行预测。具体而言,我们将所有输入转换为自然语言风格:历史轨迹被转换为文本提示,场景图像通过图像字幕进行描述。此外,输入图像的视觉特征也通过模态编码器和连接器转换为令牌。然后将转换后的数据格式化以用于语言模型。接下来,为了引导语言模型理解和推理高级知识,例如场景上下文和行人之间的社会关系,我们引入了一个辅助多任务问答。在训练方面,我们首先使用提示数据优化一个数字分词器,以有效地分离整数和小数部分,使我们能够在语言模型中捕捉连续数字之间的相关性。然后我们使用所有视觉问答提示训练我们的语言模型。在模型推理过程中,我们通过基于束搜索的最可能预测和基于温度的多模态生成实现确定性和随机预测方法。我们的VLMTraj验证了基于语言的模型可以成为强大的行人轨迹预测器,并且优于现有的基于数值的预测器方法。广泛的实验表明,VLMTraj能够成功理解社会关系,并在公共行人轨迹预测基准上准确推断多模态未来。

相似文献

1
Social Reasoning-Aware Trajectory Prediction via Multimodal Language Model.通过多模态语言模型实现的社会推理感知轨迹预测
IEEE Trans Pattern Anal Mach Intell. 2025 Jun 20;PP. doi: 10.1109/TPAMI.2025.3582000.
2
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
3
Stigma Management Strategies of Autistic Social Media Users.自闭症社交媒体用户的污名管理策略
Autism Adulthood. 2025 May 28;7(3):273-282. doi: 10.1089/aut.2023.0095. eCollection 2025 Jun.
4
Diagnostic test accuracy and cost-effectiveness of tests for codeletion of chromosomal arms 1p and 19q in people with glioma.染色体臂 1p 和 19q 缺失的检测在胶质瘤患者中的诊断准确性和成本效益。
Cochrane Database Syst Rev. 2022 Mar 2;3(3):CD013387. doi: 10.1002/14651858.CD013387.pub2.
5
Community views on mass drug administration for soil-transmitted helminths: a qualitative evidence synthesis.社区对土壤传播蠕虫群体药物给药的看法:定性证据综合分析
Cochrane Database Syst Rev. 2025 Jun 20;6:CD015794. doi: 10.1002/14651858.CD015794.pub2.
6
Factors that influence parents' and informal caregivers' views and practices regarding routine childhood vaccination: a qualitative evidence synthesis.影响父母和非正式照顾者对常规儿童疫苗接种看法和做法的因素:定性证据综合分析。
Cochrane Database Syst Rev. 2021 Oct 27;10(10):CD013265. doi: 10.1002/14651858.CD013265.pub2.
7
Immunogenicity and seroefficacy of pneumococcal conjugate vaccines: a systematic review and network meta-analysis.肺炎球菌结合疫苗的免疫原性和血清效力:系统评价和网络荟萃分析。
Health Technol Assess. 2024 Jul;28(34):1-109. doi: 10.3310/YWHA3079.
8
Assessing the comparative effects of interventions in COPD: a tutorial on network meta-analysis for clinicians.评估慢性阻塞性肺疾病干预措施的比较效果:面向临床医生的网状Meta分析教程
Respir Res. 2024 Dec 21;25(1):438. doi: 10.1186/s12931-024-03056-x.
9
Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.利用预后信息为乳腺癌患者选择辅助性全身治疗的成本效益
Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.
10
Survivor, family and professional experiences of psychosocial interventions for sexual abuse and violence: a qualitative evidence synthesis.性虐待和暴力的心理社会干预的幸存者、家庭和专业人员的经验:定性证据综合。
Cochrane Database Syst Rev. 2022 Oct 4;10(10):CD013648. doi: 10.1002/14651858.CD013648.pub2.