Lu Qiwen, Chen Shengbo, Zhu Xiaoke
School of Computer and Information Engineering, Henan University, Kaifeng 475001, China.
J Imaging. 2024 Feb 23;10(3):56. doi: 10.3390/jimaging10030056.
Language bias stands as a noteworthy concern in visual question answering (VQA), wherein models tend to rely on spurious correlations between questions and answers for prediction. This prevents the models from effectively generalizing, leading to a decrease in performance. In order to address this bias, we propose a novel modality fusion collaborative de-biasing algorithm (CoD). In our approach, bias is considered as the model's neglect of information from a particular modality during prediction. We employ a collaborative training approach to facilitate mutual modeling between different modalities, achieving efficient feature fusion and enabling the model to fully leverage multimodal knowledge for prediction. Our experiments on various datasets, including VQA-CP v2, VQA v2, and VQA-VS, using different validation strategies, demonstrate the effectiveness of our approach. Notably, employing a basic baseline model resulted in an accuracy of 60.14% on VQA-CP v2.
语言偏差是视觉问答(VQA)中一个值得关注的问题,在该领域中,模型倾向于依赖问题与答案之间的虚假关联进行预测。这阻碍了模型的有效泛化,导致性能下降。为了解决这种偏差,我们提出了一种新颖的模态融合协作去偏算法(CoD)。在我们的方法中,偏差被视为模型在预测过程中对特定模态信息的忽视。我们采用协作训练方法来促进不同模态之间的相互建模,实现高效的特征融合,并使模型能够充分利用多模态知识进行预测。我们在包括VQA-CP v2、VQA v2和VQA-VS在内的各种数据集上使用不同验证策略进行的实验证明了我们方法的有效性。值得注意的是,使用基本基线模型在VQA-CP v2上的准确率为60.14%。