Driessen Tom, Dodou Dimitra, Bazilinskyy Pavlo, de Winter Joost
Delft University of Technology, Delft, Zuid-Holland, The Netherlands.
Eindhoven University of Technology, Eindhoven, Noord-Brabant, The Netherlands.
R Soc Open Sci. 2024 May 29;11(5):231676. doi: 10.1098/rsos.231676. eCollection 2024 May.
Vision-language models are of interest in various domains, including automated driving, where computer vision techniques can accurately detect road users, but where the vehicle sometimes fails to understand context. This study examined the effectiveness of GPT-4V in predicting the level of 'risk' in traffic images as assessed by humans. We used 210 static images taken from a moving vehicle, each previously rated by approximately 650 people. Based on psychometric construct theory and using insights from the self-consistency prompting method, we formulated three hypotheses: (i) repeating the prompt under effectively identical conditions increases validity, (ii) varying the prompt text and extracting a total score increases validity compared to using a single prompt, and (iii) in a multiple regression analysis, the incorporation of object detection features, alongside the GPT-4V-based risk rating, significantly contributes to improving the model's validity. Validity was quantified by the correlation coefficient with human risk scores, across the 210 images. The results confirmed the three hypotheses. The eventual validity coefficient was = 0.83, indicating that population-level human risk can be predicted using AI with a high degree of accuracy. The findings suggest that GPT-4V must be prompted in a way equivalent to how humans fill out a multi-item questionnaire.
视觉语言模型在包括自动驾驶在内的各个领域都备受关注。在自动驾驶领域,计算机视觉技术能够准确检测道路使用者,但车辆有时难以理解上下文信息。本研究考察了GPT-4V在预测人类评估的交通图像“风险”水平方面的有效性。我们使用了从行驶车辆上拍摄的210张静态图像,每张图像之前都由大约650人进行了评分。基于心理测量建构理论并借鉴自一致性提示方法的见解,我们提出了三个假设:(i)在有效相同的条件下重复提示可提高有效性;(ii)与使用单个提示相比,改变提示文本并提取总分可提高有效性;(iii)在多元回归分析中,结合目标检测特征以及基于GPT-4V的风险评级,对提高模型的有效性有显著贡献。通过计算210张图像与人类风险评分的相关系数来量化有效性。结果证实了这三个假设。最终的有效性系数为 = 0.83,表明使用人工智能可以高度准确地预测总体水平的人类风险。研究结果表明,必须以类似于人类填写多项目问卷的方式对GPT-4V进行提示。