Hosen Sabbir, Eva Jannatul Ferdous, Hasib Ayman, Saha Aloke Kumar, Mridha M F, Wadud Anwar Hussen
Department of Computer Science and Engineering, University of Asia Pacific, Dhaka, Bangladesh.
Department of Computer Science, American International University-Bangladesh, Dhaka, Bangladesh.
Data Brief. 2023 May 18;48:109245. doi: 10.1016/j.dib.2023.109245. eCollection 2023 Jun.
This data article contains a quality assurance dataset for training the chatbot and chat analysis model. This dataset focuses on NLP tasks, as a model that serves and delivers a satisfactory response to a user's query. We obtained data from a well- known dataset known as "The Ubuntu Dialogue Corpus" for the purpose of constructing our dataset. Which consists of about one million multi-turn conversations containing around seven million utterances and one hundred million words. We derived a context for each dialogueID from these lengthy Ubuntu Dialogue Corpus conversations. We have generated a number of questions and answers based on these contexts. All of these questions and answers are contained within the context. This dataset includes 9364 contexts, 36,438 question-answer pairs. In addition to academic research, the dataset may be used for activities such as constructing this QA for another language, deep learning, language interpretation, reading comprehension, and open-domain question answering. We present the data in raw format; it has been open sourced and publicly available at https://data.mendeley.com/datasets/p85z3v45xk.
本文数据文章包含一个用于训练聊天机器人和聊天分析模型的质量保证数据集。该数据集专注于自然语言处理任务,作为一个能为用户查询提供满意回复的模型。为了构建我们的数据集,我们从一个名为“Ubuntu对话语料库”的知名数据集中获取数据。该语料库由大约一百万次多轮对话组成,包含约七百万条话语和一亿个单词。我们从这些冗长的Ubuntu对话语料库对话中为每个对话ID派生了一个上下文。我们基于这些上下文生成了许多问题和答案。所有这些问题和答案都包含在上下文中。这个数据集包括9364个上下文、36438个问答对。除学术研究外,该数据集还可用于诸如为另一种语言构建此问答、深度学习、语言翻译、阅读理解和开放域问答等活动。我们以原始格式呈现数据;它已开源并可在https://data.mendeley.com/datasets/p85z3v45xk上公开获取。