基于视觉-语言模型的作物病害少样本图像分类。

Few-Shot Image Classification of Crop Diseases Based on Vision-Language Models.

机构信息

School of Information Engineering, China University of Geosciences, Beijing 100083, China.

State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China.

出版信息

Sensors (Basel). 2024 Sep 21;24(18):6109. doi: 10.3390/s24186109.

DOI:10.3390/s24186109

PMID:39338855

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11435512/

Abstract

Accurate crop disease classification is crucial for ensuring food security and enhancing agricultural productivity. However, the existing crop disease classification algorithms primarily focus on a single image modality and typically require a large number of samples. Our research counters these issues by using pre-trained Vision-Language Models (VLMs), which enhance the multimodal synergy for better crop disease classification than the traditional unimodal approaches. Firstly, we apply the multimodal model Qwen-VL to generate meticulous textual descriptions for representative disease images selected through clustering from the training set, which will serve as prompt text for generating classifier weights. Compared to solely using the language model for prompt text generation, this approach better captures and conveys fine-grained and image-specific information, thereby enhancing the prompt quality. Secondly, we integrate cross-attention and SE (Squeeze-and-Excitation) Attention into the training-free mode VLCD(Vision-Language model for Crop Disease classification) and the training-required mode VLCD-T (VLCD-Training), respectively, for prompt text processing, enhancing the classifier weights by emphasizing the key text features. The experimental outcomes conclusively prove our method's heightened classification effectiveness in few-shot crop disease scenarios, tackling the data limitations and intricate disease recognition issues. It offers a pragmatic tool for agricultural pathology and reinforces the smart farming surveillance infrastructure.

摘要

准确的作物病害分类对于确保粮食安全和提高农业生产力至关重要。然而，现有的作物病害分类算法主要集中在单一的图像模态上，通常需要大量的样本。我们的研究通过使用预训练的视觉语言模型（VLMs）来解决这些问题，这些模型增强了多模态协同作用，比传统的单模态方法更能实现更好的作物病害分类。首先，我们应用多模态模型 Qwen-VL 从训练集中通过聚类选择代表性的病害图像，并生成细致的文本描述，这些描述将作为生成分类器权重的提示文本。与仅使用语言模型生成提示文本相比，这种方法更好地捕捉和传达了细粒度和图像特定的信息，从而提高了提示的质量。其次，我们分别将交叉注意力和 SE（Squeeze-and-Excitation）注意力集成到无训练模式 VLCD（用于作物病害分类的视觉语言模型）和有训练模式 VLCD-T（VLCD-Training）中，用于提示文本处理，通过强调关键文本特征来增强分类器权重。实验结果明确证明了我们的方法在少样本作物病害情况下的分类效果有所提高，解决了数据限制和复杂病害识别的问题。它为农业病理学提供了一个实用的工具，并增强了智能农业监测基础设施。