DriveLLaVA：通过视觉语言模型实现人类水平的行为决策。

DriveLLaVA: Human-Level Behavior Decisions via Vision Language Model.

机构信息

College of Automotive Engineering, Jilin University, Changchun 130025, China.

Graduate School of Information and Science Technology, The University of Tokyo, Tokyo 113-8654, Japan.

出版信息

Sensors (Basel). 2024 Jun 25;24(13):4113. doi: 10.3390/s24134113.

DOI:10.3390/s24134113

PMID:39000891

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11243790/

Abstract

Human-level driving is the ultimate goal of autonomous driving. As the top-level decision-making aspect of autonomous driving, behavior decision establishes short-term driving behavior strategies by evaluating road structures, adhering to traffic rules, and analyzing the intentions of other traffic participants. Existing behavior decisions are primarily implemented based on rule-based methods, exhibiting insufficient generalization capabilities when faced with new and unseen driving scenarios. In this paper, we propose a novel behavior decision method that leverages the inherent generalization and commonsense reasoning abilities of visual language models (VLMs) to learn and simulate the behavior decision process in human driving. We constructed a novel instruction-following dataset containing a large number of image-text instructions paired with corresponding driving behavior labels, to support the learning of the Drive Large Language and Vision Assistant (DriveLLaVA) and enhance the transparency and interpretability of the entire decision process. DriveLLaVA is fine-tuned on this dataset using the Low-Rank Adaptation (LoRA) approach, which efficiently optimizes the model parameter count and significantly reduces training costs. We conducted extensive experiments on a large-scale instruction-following dataset, and compared with state-of-the-art methods, DriveLLaVA demonstrated excellent behavior decision performance. DriveLLaVA is capable of handling various complex driving scenarios, showing strong robustness and generalization abilities.

摘要

人类级别的驾驶是自动驾驶的终极目标。作为自动驾驶的顶级决策方面，行为决策通过评估道路结构、遵守交通规则和分析其他交通参与者的意图来制定短期驾驶行为策略。现有的行为决策主要基于基于规则的方法，在面对新的和未见过的驾驶场景时，表现出不足的泛化能力。在本文中，我们提出了一种新的行为决策方法，利用视觉语言模型（VLMs）的固有泛化和常识推理能力来学习和模拟人类驾驶中的行为决策过程。我们构建了一个新的指令跟随数据集，其中包含大量的图像-文本指令以及相应的驾驶行为标签，以支持 Drive Large Language and Vision Assistant (DriveLLaVA) 的学习，并增强整个决策过程的透明度和可解释性。我们使用低秩自适应（LoRA）方法对这个数据集进行了微调，这种方法有效地优化了模型参数计数，并大大降低了训练成本。我们在一个大规模的指令跟随数据集上进行了广泛的实验，并与最先进的方法进行了比较，DriveLLaVA 表现出了优异的行为决策性能。DriveLLaVA 能够处理各种复杂的驾驶场景，表现出强大的鲁棒性和泛化能力。