Cheng Yao, Luo Senlin, Wan Yunwei, Pan Limin, Li Xinshuai
School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, PR China.
Neural Netw. 2025 Mar;183:106971. doi: 10.1016/j.neunet.2024.106971. Epub 2024 Nov 30.
In black-box scenarios, adversarial attacks against text classification models face challenges in ensuring highly available adversarial samples, especially a high number of invalid queries under long texts. The existing methods select distractors by comparing the confidence vector differences obtained before and after deleting words, and the query increases linearly with the length of the text, making it difficult to apply to attack scenarios with limited queries. Generating adversarial samples based on a thesaurus can lead to semantic inconsistencies and even grammatical errors, making it easy for the target model to recognize adversarial samples and resulting in a low success rate of attacks. A parallel and highly stealthy Adversarial Attack against Text Classification Model (AdATCM) is proposed, which reinforces dual-task of attack and generation. This method does not require querying the target model during the selection of distractors. Instead, it directly uses contextual information to calculate the importance of words and selects distractors in one go, strengthening the concealment of attacks. Integrating KL divergence loss, cross entropy loss, and adversarial loss to construct an objective function for training an adversarial sample attack model, generating adversarial samples that can fit the original sample distribution and strengthen the success rate of attacks. The experimental results show that this method has a high success rate and strong concealment, effectively reducing the number of attack queries under long text conditions.
在黑箱场景中,针对文本分类模型的对抗攻击在确保高可用性对抗样本方面面临挑战,尤其是在长文本下存在大量无效查询。现有方法通过比较删除单词前后获得的置信度向量差异来选择干扰项,并且查询数量随文本长度线性增加,这使得其难以应用于查询受限的攻击场景。基于同义词库生成对抗样本可能会导致语义不一致甚至语法错误,从而使目标模型很容易识别对抗样本,导致攻击成功率较低。提出了一种针对文本分类模型的并行且高度隐蔽的对抗攻击方法(AdATCM),该方法强化了攻击和生成的双重任务。此方法在选择干扰项时不需要查询目标模型。相反,它直接利用上下文信息计算单词的重要性并一次性选择干扰项,增强了攻击的隐蔽性。整合KL散度损失、交叉熵损失和对抗损失来构建用于训练对抗样本攻击模型的目标函数,生成能够拟合原始样本分布并提高攻击成功率的对抗样本。实验结果表明,该方法具有较高的成功率和较强的隐蔽性,能有效减少长文本条件下的攻击查询数量。