Pan Qingtao, Li Zhengrong, Qiao Wenhao, Lou Jingjiao, Yang Qing, Yang Guang, Ji Bing
IEEE Trans Med Imaging. 2025 May 23;PP. doi: 10.1109/TMI.2025.3573018.
Low-quality pseudo labels pose a significant obstacle in semi-supervised medical image segmentation (SSMIS), impeding consistency learning on unlabeled data. Leveraging vision-language model (VLM) holds promise in ameliorating pseudo label quality by employing textual prompts to delineate segmentation regions, but it faces the challenge of cross-modal alignment uncertainty due to multiple correspondences (multiple images/texts tend to correspond to one text/image). Existing VLMs address this challenge by modeling semantics as distributions but such distributions lead to semantic degradation. To address these problems, we propose Alignment-Multiplicity Aware Vision-Language Model (AMVLM), a new VLM pre-training paradigm with two novel similarity metric strategies. (i) Cross-modal Similarity Supervision (CSS) proposes a probability distribution transformer to supervise similarity scores across fine-granularity semantics through measuring cross-modal distribution disparities, thus learning cross-modal multiple alignments. (ii) Intra-modal Contrastive Learning (ICL) takes into account the similarity metric of coarse-fine granularity information within each modality to encourage cross-modal semantic consistency. Furthermore, using the pretrained AMVLM, we propose a pioneering text-guided SSMIS network to compensate for the quality deficiencies of pseudo-labels. This network incorporates a text mask generator to produce multimodal supervision information, enhancing pseudo label quality and the model's consistency learning. Extensive experimentation validates the efficacy of our AMVLM-driven SSMIS, showcasing superior performance across four publicly available datasets. The code will be available at: https://github.com/QingtaoPan/AMVLM.
低质量伪标签在半监督医学图像分割(SSMIS)中构成了重大障碍,阻碍了对未标记数据的一致性学习。利用视觉语言模型(VLM)有望通过使用文本提示来描绘分割区域来改善伪标签质量,但由于存在多种对应关系(多个图像/文本往往对应于一个文本/图像),它面临跨模态对齐不确定性的挑战。现有的VLM通过将语义建模为分布来应对这一挑战,但这种分布会导致语义退化。为了解决这些问题,我们提出了对齐-多重感知视觉语言模型(AMVLM),这是一种具有两种新颖相似性度量策略的新型VLM预训练范式。(i)跨模态相似性监督(CSS)提出了一种概率分布变换器,通过测量跨模态分布差异来监督细粒度语义上的相似性分数,从而学习跨模态多重对齐。(ii)模态内对比学习(ICL)考虑了每个模态内粗细粒度信息的相似性度量,以鼓励跨模态语义一致性。此外,使用预训练的AMVLM,我们提出了一个开创性的文本引导SSMIS网络,以弥补伪标签的质量缺陷。该网络包含一个文本掩码生成器,以产生多模态监督信息,提高伪标签质量和模型的一致性学习。广泛的实验验证了我们的AMVLM驱动的SSMIS的有效性,在四个公开可用数据集上展示了卓越的性能。代码将在以下网址提供:https://github.com/QingtaoPan/AMVLM 。