Bae Inhwan, Lee Junoh, Jeon Hae-Gon
IEEE Trans Pattern Anal Mach Intell. 2025 Jun 20;PP. doi: 10.1109/TPAMI.2025.3582000.
Recent advancements in language models have demonstrated its capacity of context understanding and generative representations. Leveraged by these developments, we propose a novel multimodal trajectory predictor based on a vision-language model, named VLMTraj, which fully takes advantage of the prior knowledge of multimodal large language models and the human-like reasoning across diverse modality information. The key idea of our model is to reframe the trajectory prediction task into a visual question answering format, using historical information as context and instructing the language model to make predictions in a conversational manner. Specifically, we transform all the inputs into a natural language style: historical trajectories are converted into text prompts, and scene images are described through image captioning. Additionally, visual features from input images are also transformed into tokens via a modality encoder and connector. The transformed data is then formatted to be used in a language model. Next, in order to guide the language model in understanding and reasoning high-level knowledge, such as scene context and social relationships between pedestrians, we introduce an auxiliary multi-task question and answers. For training, we first optimize a numerical tokenizer with the prompt data to effectively separate integer and decimal parts, allowing us to capture correlations between consecutive numbers in the language model. We then train our language model using all the visual question answering prompts. During model inference, we implement both deterministic and stochastic prediction methods through beam-search-based most-likely prediction and temperature-based multimodal generation. Our VLMTraj validates that the language-based model can be a powerful pedestrian trajectory predictor, and outperforms existing numerical-based predictor methods. Extensive experiments show that VLMTraj can successfully understand social relationships and accurately extrapolate the multimodal futures on public pedestrian trajectory prediction benchmarks.
语言模型的最新进展展示了其上下文理解和生成式表征的能力。受这些进展的推动,我们提出了一种基于视觉语言模型的新型多模态轨迹预测器,名为VLMTraj,它充分利用了多模态大语言模型的先验知识以及跨不同模态信息的类人推理。我们模型的关键思想是将轨迹预测任务重新构建为视觉问答格式,使用历史信息作为上下文,并指导语言模型以对话方式进行预测。具体而言,我们将所有输入转换为自然语言风格:历史轨迹被转换为文本提示,场景图像通过图像字幕进行描述。此外,输入图像的视觉特征也通过模态编码器和连接器转换为令牌。然后将转换后的数据格式化以用于语言模型。接下来,为了引导语言模型理解和推理高级知识,例如场景上下文和行人之间的社会关系,我们引入了一个辅助多任务问答。在训练方面,我们首先使用提示数据优化一个数字分词器,以有效地分离整数和小数部分,使我们能够在语言模型中捕捉连续数字之间的相关性。然后我们使用所有视觉问答提示训练我们的语言模型。在模型推理过程中,我们通过基于束搜索的最可能预测和基于温度的多模态生成实现确定性和随机预测方法。我们的VLMTraj验证了基于语言的模型可以成为强大的行人轨迹预测器,并且优于现有的基于数值的预测器方法。广泛的实验表明,VLMTraj能够成功理解社会关系,并在公共行人轨迹预测基准上准确推断多模态未来。