CTDUNet：一种用于复杂环境中病虫害分割的具有坐标空间注意力的多模态卷积神经网络-Transformer双U型网络

CTDUNet: A Multimodal CNN-Transformer Dual U-Shaped Network with Coordinate Space Attention for Pests and Diseases Segmentation in Complex Environments.

作者信息

Guo Ruitian, Zhang Ruopeng, Zhou Hao, Xie Tunjun, Peng Yuting, Chen Xili, Yu Guo, Wan Fangying, Li Lin, Zhang Yongzhong, Liu Ruifeng

机构信息

School of Electronic Information and Physics, Central South University of Forestry and Technology, Changsha 410004, China.

School of Business, Central South University of Forestry and Technology, Changsha 410004, China.

出版信息

Plants (Basel). 2024 Aug 15;13(16):2274. doi: 10.3390/plants13162274.

DOI:10.3390/plants13162274

PMID:39204710

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11359422/

Abstract

is a crop of high economic value, yet it is particularly susceptible to various diseases and pests that significantly reduce its yield and quality. Consequently, the precise segmentation and classification of diseased Camellia leaves are vital for managing pests and diseases effectively. Deep learning exhibits significant advantages in the segmentation of plant diseases and pests, particularly in complex image processing and automated feature extraction. However, when employing single-modal models to segment diseases, three critical challenges arise: (A) lesions may closely resemble the colors of the complex background; (B) small sections of diseased leaves overlap; (C) the presence of multiple diseases on a single leaf. These factors considerably hinder segmentation accuracy. A novel multimodal model, CNN-Transformer Dual U-shaped Network (CTDUNet), based on a CNN-Transformer architecture, has been proposed to integrate image and text information. This model first utilizes text data to address the shortcomings of single-modal image features, enhancing its ability to distinguish lesions from environmental characteristics, even under conditions where they closely resemble one another. Additionally, we introduce Coordinate Space Attention (CSA), which focuses on the positional relationships between targets, thereby improving the segmentation of overlapping leaf edges. Furthermore, cross-attention (CA) is employed to align image and text features effectively, preserving local information and enhancing the perception and differentiation of various diseases. The CTDUNet model was evaluated on a self-made multimodal dataset compared against several models, including DeeplabV3+, UNet, PSPNet, Segformer, HrNet, and Language meets Vision Transformer (LViT). The experimental results demonstrate that CTDUNet achieved an mean Intersection over Union (mIoU) of 86.14%, surpassing both multimodal models and the best single-modal model by 3.91% and 5.84%, respectively. Additionally, CTDUNet exhibits high balance in the multi-class segmentation of diseases and pests. These results indicate the successful application of fused image and text multimodal information in the segmentation of Camellia disease, achieving outstanding performance.

摘要

茶树是一种具有高经济价值的作物，但它特别容易受到各种病虫害的影响，这些病虫害会显著降低其产量和品质。因此，对患病茶树叶片进行精确的分割和分类对于有效防治病虫害至关重要。深度学习在植物病虫害分割方面具有显著优势，尤其是在复杂图像处理和自动特征提取方面。然而，在使用单模态模型进行病害分割时，会出现三个关键挑战：（A）病斑颜色可能与复杂背景颜色极为相似；（B）患病叶片的小部分相互重叠；（C）单片叶子上存在多种病害。这些因素极大地阻碍了分割精度。基于卷积神经网络-Transformer架构，提出了一种新颖的多模态模型，即卷积神经网络-Transformer双U型网络（CTDUNet），以整合图像和文本信息。该模型首先利用文本数据来弥补单模态图像特征的不足，增强其在病斑与环境特征极为相似的情况下区分病斑与环境特征之间的能力。此外，我们引入了坐标空间注意力（CSA），它专注于目标之间的位置关系，从而改善重叠叶片边缘的分割。此外，采用交叉注意力（CA）来有效对齐图像和文本特征，保留局部信息并增强对各种病害的感知和区分能力。CTDUNet模型在一个自制的多模态数据集上与包括DeeplabV3 +、UNet、PSPNet、Segformer、HrNet和语言与视觉Transformer（LViT）在内的多个模型进行了比较评估。实验结果表明，CTDUNet的平均交并比（mIoU）达到了86.14%，分别比多模态模型和最佳单模态模型高出3.91%和5.84%。此外，CTDUNet在病虫害的多类分割中表现出高度的平衡性。这些结果表明融合图像和文本多模态信息在茶树病害分割中成功应用，取得了优异的性能。