Saad Asif Mohammed, Mahi Umme Niraj, Salim Md Shahidul, Hossain Sk Imran
Khulna University of Engineering & Technology, Khulna 9203, Bangladesh.
Data Brief. 2024 Aug 24;57:110874. doi: 10.1016/j.dib.2024.110874. eCollection 2024 Dec.
In this research, we present an updated standard Bangla dataset based on gathered Bangla news articles. In total, more than 1.9 million articles from nine Bangla news websites were gathered; the selection process was led by a number of categories, including sports, economy, politics, local news, tech, tourism, entertainment, education, health, the arts, and many more. The dataset per newspaper contains varying attributes, such as title, content, time, tags, meta, category, etc. This dataset will enable data scientists to investigate and assess theories related to Bangla natural language processing. Furthermore, there is a greater chance that the dataset will be utilized for domain-specific large language models in the context of Bangladesh, and it may be used to develop deep learning and machine learning models that categorize articles according to subjects.
在本研究中,我们基于收集到的孟加拉语新闻文章展示了一个更新的标准孟加拉语数据集。总共从九个孟加拉语新闻网站收集了超过190万篇文章;选择过程由多个类别主导,包括体育、经济、政治、本地新闻、科技、旅游、娱乐、教育、健康、艺术等等。每个报纸的数据集包含不同的属性,如标题、内容、时间、标签、元数据、类别等。该数据集将使数据科学家能够研究和评估与孟加拉语自然语言处理相关的理论。此外,该数据集在孟加拉国的背景下更有可能被用于特定领域的大语言模型,并且它可用于开发根据主题对文章进行分类的深度学习和机器学习模型。