Al-Majmar Nashwan Ahmed, Gawbah Hezam, Alsubari Akram
Department of CS and IT, Faculty of Science, Ibb University, Yemen.
Department of Computers, Aljazeera University, Yemen.
Data Brief. 2024 Aug 22;56:110855. doi: 10.1016/j.dib.2024.110855. eCollection 2024 Oct.
With the soaring demand for healthcare systems, chatbots are gaining tremendous popularity and research attention. Numerous language-centric research on healthcare is conducted day by day. Despite significant advances in Arabic Natural Language Processing (NLP), challenges remain in natural language classification and generation due to the lack of suitable datasets. The primary shortcoming of these models is the lack of suitable Arabic datasets for training. To address this, authors introduce a large Arabic Healthcare Dataset (AHD) of textual data. The dataset consists of over 808k questions and answers across 90 categories, offered to the research community for Arabic computational linguistics. Authors anticipate that this rich dataset would make a great aid for a variety of NLP tasks on Arabic textual data, especially for text classification and generation purposes. Authors present the data in raw form. AHD is composed of main dataset scraped from medical website, which is Altibbi website. AHD is made public and freely available at http://data.mendeley.com/datasets/mgj29ndgrk/5.
随着医疗保健系统需求的飙升,聊天机器人越来越受欢迎并受到研究关注。每天都有大量以语言为中心的医疗保健研究在进行。尽管阿拉伯语自然语言处理(NLP)取得了重大进展,但由于缺乏合适的数据集,自然语言分类和生成仍面临挑战。这些模型的主要缺点是缺乏适合训练的阿拉伯语数据集。为了解决这个问题,作者引入了一个大型阿拉伯语医疗保健文本数据集(AHD)。该数据集由90个类别的超过80.8万个问题和答案组成,提供给阿拉伯语计算语言学研究社区。作者预计,这个丰富的数据集将极大地有助于处理阿拉伯语文本数据上的各种NLP任务,特别是用于文本分类和生成目的。作者以原始形式呈现数据。AHD由从医疗网站(即Altibbi网站)抓取的主要数据集组成。AHD已公开并可在http://data.mendeley.com/datasets/mgj29ndgrk/5免费获取。