Sumikawa Yasunobu, Fujiyoshi Masaaki, Hatakeyama Hisashi, Nagai Masahiro
Tokyo Metropolitan University, Japan.
Data Brief. 2019 May 24;25:104001. doi: 10.1016/j.dib.2019.104001. eCollection 2019 Aug.
In this data article, we present an FAQ dataset written in Japanese and its translation to English in order to train chatbot models for e-learning systems. We first collected raw Q&A data reported as the difficulties from April 2015 to July 2018 by users of the e-learning system introduced at Tokyo Metropolitan University. We then divided them into 11 categories according to features provided by the e-learning system. Finally, we integrated questions with the same answers in order to create the FAQ form. The dataset contains 427 questions and 79 answers that were examined by experts with experience in using the e-learning system for more than three years. Using this dataset, we performed statistical analyses to evaluate the qualities of the FAQ dataset. The proposed applications of the dataset include not only academic research but also activities; for example, translating from Japanese to another one like Chinese, adapting/modifying our dataset for another e-learning system, and developing language models to obtain highly accurate responses from chatbots.
在本数据文章中,我们展示了一个用日语编写的常见问题解答(FAQ)数据集及其英文翻译,以便为电子学习系统训练聊天机器人模型。我们首先收集了2015年4月至2018年7月期间东京都立大学引入的电子学习系统用户报告的作为难点的原始问答数据。然后,我们根据电子学习系统提供的特征将它们分为11类。最后,我们整合了具有相同答案的问题以创建FAQ表单。该数据集包含427个问题和79个答案,这些问题和答案由使用电子学习系统三年以上的专家进行了审核。使用此数据集,我们进行了统计分析以评估FAQ数据集的质量。该数据集的拟议应用不仅包括学术研究,还包括活动;例如,从日语翻译成中文等另一种语言、为另一个电子学习系统调整/修改我们的数据集,以及开发语言模型以从聊天机器人获得高度准确的回复。