Aurpa Tanjim Taharat, Ahmed Md Shoaib, Rifat Richita Khandakar, Anwar Md Musfique, Shawkat Ali A B M
Department of Computer Science and Engineering, Jahangirnagar University, Savar, Dhaka, Bangladesh.
Department of Computer Science and Engineering, International University of Business Agriculture and Technology, Bangladesh.
Data Brief. 2023 Feb 2;47:108933. doi: 10.1016/j.dib.2023.108933. eCollection 2023 Apr.
The popularity of reading comprehension (RC) is increasing day-to-day in Bangla Natural Language Processing (NLP) research area, both in machine learning and deep learning techniques. However, there is no original dataset from various sources in the Bangla language except translated from foreign RC datasets, which contain abnormalities and mismatched translated data. In his paper, we present UDDIPOK, a novel wide-ranging, open-domain Bangla reading comprehension dataset. This dataset contains 270 reading passages, 3636 questions, and answers from diverse origins, for instance, textbooks, exam questions from middle and high schools, newspapers, etc. Furthermore, this dataset is formated in CSV, which contains three columns: passages, questions, and answers. As a result, data can be handled expeditiously and easily for any machine learning research.
在孟加拉语自然语言处理(NLP)研究领域,无论是机器学习还是深度学习技术,阅读理解(RC)的受欢迎程度都在与日俱增。然而,除了从外国RC数据集翻译过来的之外,没有来自各种来源的孟加拉语原始数据集,而这些翻译过来的数据集存在异常和不匹配的翻译数据。在本文中,我们展示了UDDIPOK,一个新颖的、广泛的、开放域的孟加拉语阅读理解数据集。该数据集包含270篇阅读文章、3636个问题以及来自不同来源的答案,例如教科书、初中和高中的考试问题、报纸等。此外,该数据集采用CSV格式,包含三列:文章、问题和答案。因此,对于任何机器学习研究来说,数据都可以快速且轻松地处理。