Cohen Dror, Rosenberger Ido, Butman Moshe, Bar Kfir
Faculty of Computer Science, The College of Management Academic Studies, Rishon LeZion, Israel.
Front Artif Intell. 2023 Mar 22;6:1091443. doi: 10.3389/frai.2023.1091443. eCollection 2023.
Deep neural networks have been proven effective in classifying human interactions into emotions, especially by encoding multiple input modalities. In this work, we assess the robustness of a transformer-based multimodal audio-text classifier for emotion recognition, by perturbing the input at inference time using attacks which we design specifically to corrupt information deemed important for emotion recognition. To measure the impact of the attacks on the classifier, we compare between the accuracy of the classifier on the perturbed input and on the original, unperturbed input. Our results show that the multimodal classifier is more resilient to perturbation attacks than the equivalent unimodal classifiers, suggesting that the two modalities are encoded in a way that allows the classifier to benefit from one modality even when the other one is slightly damaged.
深度神经网络已被证明在将人类交互分类为情感方面是有效的,特别是通过对多种输入模态进行编码。在这项工作中,我们通过在推理时使用专门设计的攻击来干扰输入,以破坏被认为对情感识别重要的信息,从而评估基于Transformer的多模态音频-文本分类器在情感识别方面的鲁棒性。为了衡量攻击对分类器的影响,我们比较了分类器在受干扰输入和原始未受干扰输入上的准确率。我们的结果表明,多模态分类器比等效的单模态分类器对扰动攻击更具弹性,这表明两种模态的编码方式使得分类器即使在另一种模态略有损坏时也能从一种模态中受益。