基于提示的多模态景观美景评估与视觉语言模型。

Prompt-guided and multimodal landscape scenicness assessments with vision-language models.

机构信息

Laboratory of Geo-Information Science and Remote Sensing, Wageningen University, Wageningen, the Netherlands.

Instituut voor Milieuvraagstukken, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands.

出版信息

PLoS One. 2024 Sep 30;19(9):e0307083. doi: 10.1371/journal.pone.0307083. eCollection 2024.

DOI:10.1371/journal.pone.0307083

PMID:39348404

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11441650/

Abstract

Recent advances in deep learning and Vision-Language Models (VLM) have enabled efficient transfer to downstream tasks even when limited labelled training data is available, as well as for text to be directly compared to image content. These properties of VLMs enable new opportunities for the annotation and analysis of images. We test the potential of VLMs for landscape scenicness prediction, i.e., the aesthetic quality of a landscape, using zero- and few-shot methods. We experiment with few-shot learning by fine-tuning a single linear layer on a pre-trained VLM representation. We find that a model fitted to just a few hundred samples performs favourably compared to a model trained on hundreds of thousands of examples in a fully supervised way. We also explore the zero-shot prediction potential of contrastive prompting using positive and negative landscape aesthetic concepts. Our results show that this method outperforms a linear probe with few-shot learning when using a small number of samples to tune the prompt configuration. We introduce Landscape Prompt Ensembling (LPE), which is an annotation method for acquiring landscape scenicness ratings through rated text descriptions without needing an image dataset during annotation. We demonstrate that LPE can provide landscape scenicness assessments that are concordant with a dataset of image ratings. The success of zero- and few-shot methods combined with their ability to use text-based annotations highlights the potential for VLMs to provide efficient landscape scenicness assessments with greater flexibility.

摘要

深度学习和视觉语言模型 (VLM) 的最新进展使得即使在可用的有限标记训练数据的情况下，也能够有效地转移到下游任务，并且可以直接将文本与图像内容进行比较。VLMs 的这些特性为图像的注释和分析提供了新的机会。我们使用零样本和少样本方法测试 VLM 在景观美景预测（即景观的美学质量）方面的潜力。我们通过在预训练的 VLM 表示上微调单个线性层来进行少样本学习实验。我们发现，与在完全监督方式下使用数十万示例训练的模型相比，仅拟合几百个示例的模型表现良好。我们还探索了使用正负面景观美学概念的对比提示进行零样本预测的潜力。我们的结果表明，当使用少量样本调整提示配置时，这种方法在使用少数样本进行微调时，比具有少样本学习的线性探针表现更好。我们引入了景观提示集成 (LPE)，这是一种通过带有评分的文本描述来获取景观美景评分的注释方法，在注释过程中不需要图像数据集。我们证明了 LPE 可以提供与图像评分数据集一致的景观美景评估。零样本和少样本方法的成功以及它们能够使用基于文本的注释的能力突出了 VLM 提供更灵活的高效景观美景评估的潜力。