Saikh Tanik, Ghosal Tirthankar, Mittal Amish, Ekbal Asif, Bhattacharyya Pushpak
Department of Computer Science and Engineering, Indian Institute of Technology Patna, Bihta, Patna, India.
Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Malostranské náméstí 25, 118 00 Praha, Czech Republic.
Int J Digit Libr. 2022;23(3):289-301. doi: 10.1007/s00799-022-00329-y. Epub 2022 Jul 20.
Machine Reading Comprehension (MRC) of a document is a challenging problem that requires discourse-level understanding. Information extraction from scholarly articles nowadays is a critical use case for researchers to understand the underlying research quickly and move forward, especially in this age of infodemic. MRC on research articles can also provide helpful information to the reviewers and editors. However, the main bottleneck in building such models is the availability of human-annotated data. In this paper, firstly, we introduce a dataset to facilitate question answering (QA) on scientific articles. We prepare the dataset in a semi-automated fashion having more than 100k human-annotated context-question-answer triples. Secondly, we implement one baseline QA model based on Bidirectional Encoder Representations from Transformers (BERT). Additionally, we implement two models: the first one is based on Science BERT (SciBERT), and the second is the combination of SciBERT and Bi-Directional Attention Flow (Bi-DAF). The best model (i.e., SciBERT) obtains an F1 score of 75.46%. Our dataset is novel, and our work opens up a new avenue for scholarly document processing research by providing a benchmark QA dataset and standard baseline. We make our dataset and codes available here at https://github.com/TanikSaikh/Scientific-Question-Answering.
文档的机器阅读理解(MRC)是一个具有挑战性的问题,需要语篇层面的理解。如今,从学术文章中提取信息对于研究人员快速理解基础研究并取得进展而言是一个关键用例,尤其是在这个信息疫情的时代。对研究文章进行机器阅读理解还可以为审稿人和编辑提供有用信息。然而,构建此类模型的主要瓶颈在于人工标注数据的可用性。在本文中,首先,我们引入了一个数据集以促进对科学文章的问答(QA)。我们以半自动方式准备该数据集,拥有超过10万个带有人工标注的上下文 - 问题 - 答案三元组。其次,我们基于来自变换器的双向编码器表示(BERT)实现了一个基线问答模型。此外,我们还实现了两个模型:第一个基于科学BERT(SciBERT),第二个是SciBERT与双向注意力流(Bi - DAF)的组合。最佳模型(即SciBERT)获得了75.46%的F1分数。我们的数据集是新颖的,并且我们的工作通过提供一个基准问答数据集和标准基线,为学术文档处理研究开辟了一条新途径。我们在https://github.com/TanikSaikh/Scientific-Question-Answering上提供了我们的数据集和代码。