Ma Jie, Chai Qi, Huang Jingyue, Liu Jun, You Yang, Zheng Qinghua
IEEE Trans Image Process. 2022;31:7378-7388. doi: 10.1109/TIP.2022.3180563. Epub 2022 Dec 1.
Textbook Question Answering (TQA) is the task of answering diagram and non-diagram questions given large multi-modal contexts consisting of abundant text and diagrams. Deep text understandings and effective learning of diagram semantics are important for this task due to its specificity. In this paper, we propose a Weakly Supervised learning method for TQA (WSTQ), which regards the incompletely accurate results of essential intermediate procedures for this task as supervision to develop Text Matching (TM) and Relation Detection (RD) tasks and then employs the tasks to motivate itself to learn strong text comprehension and excellent diagram semantics respectively. Specifically, we apply the result of text retrieval to build positive as well as negative text pairs. In order to learn deep text understandings, we first pre-train the text understanding module of WSTQ on TM and then fine-tune it on TQA. We build positive as well as negative relation pairs by checking whether there is any overlap between the items/regions detected from diagrams using object detection. The RD task forces our method to learn the relationships between regions, which are crucial to express the diagram semantics. We train WSTQ on RD and TQA simultaneously, i.e., multitask learning, to obtain effective diagram semantics and then improve the TQA performance. Extensive experiments are carried out on CK12-QA and AI2D to verify the effectiveness of WSTQ. Experimental results show that our method achieves significant accuracy improvements of 5.02% and 4.12% on test splits of the above datasets respectively than the current state-of-the-art baseline. We have released our code on https://github.com/dr-majie/WSTQ.
教科书问答(TQA)是一项在由大量文本和图表组成的大型多模态语境下回答图表及非图表问题的任务。由于该任务的特殊性,深度文本理解和图表语义的有效学习对于此任务至关重要。在本文中,我们提出了一种用于TQA的弱监督学习方法(WSTQ),该方法将此任务基本中间过程的不完全准确结果视为监督,以开发文本匹配(TM)和关系检测(RD)任务,然后利用这些任务促使自身分别学习强大的文本理解能力和出色的图表语义。具体而言,我们应用文本检索结果来构建正、负文本对。为了学习深度文本理解,我们首先在TM上对WSTQ的文本理解模块进行预训练,然后在TQA上对其进行微调。我们通过检查使用目标检测从图表中检测到的项目/区域之间是否存在重叠来构建正、负关系对。RD任务迫使我们的方法学习区域之间的关系,这对于表达图表语义至关重要。我们在RD和TQA上同时训练WSTQ,即多任务学习,以获得有效的图表语义,进而提高TQA性能。我们在CK12-QA和AI2D上进行了大量实验,以验证WSTQ的有效性。实验结果表明,我们的方法在上述数据集的测试分割上分别比当前最先进的基线显著提高了5.02%和4.12%的准确率。我们已将代码发布在https://github.com/dr-majie/WSTQ上。