Shandong Provincial University Laboratory for Protected Horticulture, Weifang University of Science and Technology, Weifang, China.
Sci Rep. 2024 Nov 22;14(1):28973. doi: 10.1038/s41598-024-80675-w.
Pepper diseases and pests typically exhibit small target proportions, diverse shapes and sizes, complex imaging backgrounds, and similarities with the background. Existing detection methods perform poorly in identifying targets of different sizes and shapes within the same scene, and they lack adequate noise suppression capabilities. To address the practical needs of detecting pepper diseases and pests in complex scenarios, we have constructed the first multimodal pepper diseases and pests object detection dataset (PDD). This dataset includes a wide variety of diseases and pests images, along with detailed natural language descriptions of their attributes. Locating the described targets in complex scenes with similar disease symptoms and leaf occlusion presents a significant challenge. To tackle this issue, we propose the PepperNet model for object detection in pepper diseases and pests images using natural language descriptions. This model decomposes complex multimodal features of language and images into explicit attribute features and employs fine-grained multimodal attribute contrast learning strategies. This approach effectively distinguishes subtle local differences between similar objects, achieving fine-grained mapping from language to vision in complex scenarios. Our detection results show a mAP@0.5 of 91.93% and a detection speed of 121.8 frames per second. Visualizations indicate that the model maintains high robustness under varying noise levels and occlusion conditions, demonstrating superior performance and stability across diverse complex scenarios.
辣椒病虫害通常表现为小目标比例、形状和大小多样、成像背景复杂以及与背景相似。现有的检测方法在识别同一场景中不同大小和形状的目标方面表现不佳,并且缺乏足够的噪声抑制能力。为了满足在复杂场景中检测辣椒病虫害的实际需求,我们构建了第一个多模态辣椒病虫害目标检测数据集(PDD)。该数据集包含各种病虫害图像,以及对其属性的详细自然语言描述。在具有相似病害症状和叶片遮挡的复杂场景中定位描述的目标是一项重大挑战。为了解决这个问题,我们提出了使用自然语言描述的 PepperNet 模型来进行辣椒病虫害图像中的目标检测。该模型将语言和图像的复杂多模态特征分解为显式属性特征,并采用细粒度的多模态属性对比学习策略。这种方法有效地区分了相似物体之间细微的局部差异,实现了复杂场景中从语言到视觉的精细映射。我们的检测结果表明,mAP@0.5 为 91.93%,检测速度为 121.8 帧/秒。可视化结果表明,该模型在不同噪声水平和遮挡条件下保持了较高的鲁棒性,在各种复杂场景下表现出优异的性能和稳定性。