Suppr超能文献

HQA-数据:一个来自以往多视角对话的历史问答生成数据集。

HQA-Data: A historical question answer generation dataset from previous multi perspective conversation.

作者信息

Hosen Sabbir, Eva Jannatul Ferdous, Hasib Ayman, Saha Aloke Kumar, Mridha M F, Wadud Anwar Hussen

机构信息

Department of Computer Science and Engineering, University of Asia Pacific, Dhaka, Bangladesh.

Department of Computer Science, American International University-Bangladesh, Dhaka, Bangladesh.

出版信息

Data Brief. 2023 May 18;48:109245. doi: 10.1016/j.dib.2023.109245. eCollection 2023 Jun.

Abstract

This data article contains a quality assurance dataset for training the chatbot and chat analysis model. This dataset focuses on NLP tasks, as a model that serves and delivers a satisfactory response to a user's query. We obtained data from a well- known dataset known as "The Ubuntu Dialogue Corpus" for the purpose of constructing our dataset. Which consists of about one million multi-turn conversations containing around seven million utterances and one hundred million words. We derived a context for each dialogueID from these lengthy Ubuntu Dialogue Corpus conversations. We have generated a number of questions and answers based on these contexts. All of these questions and answers are contained within the context. This dataset includes 9364 contexts, 36,438 question-answer pairs. In addition to academic research, the dataset may be used for activities such as constructing this QA for another language, deep learning, language interpretation, reading comprehension, and open-domain question answering. We present the data in raw format; it has been open sourced and publicly available at https://data.mendeley.com/datasets/p85z3v45xk.

摘要

本文数据文章包含一个用于训练聊天机器人和聊天分析模型的质量保证数据集。该数据集专注于自然语言处理任务,作为一个能为用户查询提供满意回复的模型。为了构建我们的数据集,我们从一个名为“Ubuntu对话语料库”的知名数据集中获取数据。该语料库由大约一百万次多轮对话组成,包含约七百万条话语和一亿个单词。我们从这些冗长的Ubuntu对话语料库对话中为每个对话ID派生了一个上下文。我们基于这些上下文生成了许多问题和答案。所有这些问题和答案都包含在上下文中。这个数据集包括9364个上下文、36438个问答对。除学术研究外,该数据集还可用于诸如为另一种语言构建此问答、深度学习、语言翻译、阅读理解和开放域问答等活动。我们以原始格式呈现数据;它已开源并可在https://data.mendeley.com/datasets/p85z3v45xk上公开获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2e3a/10294004/00950368811d/gr1.jpg

相似文献

1
HQA-Data: A historical question answer generation dataset from previous multi perspective conversation.
Data Brief. 2023 May 18;48:109245. doi: 10.1016/j.dib.2023.109245. eCollection 2023 Jun.
2
UDDIPOK: A reading comprehension based question answering dataset in Bangla language.
Data Brief. 2023 Feb 2;47:108933. doi: 10.1016/j.dib.2023.108933. eCollection 2023 Apr.
3
Reading comprehension based question answering system in Bangla language with transformer-based learning.
Heliyon. 2022 Oct 12;8(10):e11052. doi: 10.1016/j.heliyon.2022.e11052. eCollection 2022 Oct.
5
SemBioNLQA: A semantic biomedical question answering system for retrieving exact and ideal answers to natural language questions.
Artif Intell Med. 2020 Jan;102:101767. doi: 10.1016/j.artmed.2019.101767. Epub 2019 Nov 28.
8
AHD: Arabic healthcare dataset.
Data Brief. 2024 Aug 22;56:110855. doi: 10.1016/j.dib.2024.110855. eCollection 2024 Oct.
9
Single and Multi-Hop Question-Answering Datasets for Reticular Chemistry with GPT-4-Turbo.
J Chem Theory Comput. 2024 Oct 22;20(20):9128-9137. doi: 10.1021/acs.jctc.4c00805. Epub 2024 Oct 8.
10
MedChatZH: A tuning LLM for traditional Chinese medicine consultations.
Comput Biol Med. 2024 Apr;172:108290. doi: 10.1016/j.compbiomed.2024.108290. Epub 2024 Mar 13.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验