Suppr超能文献

孟加拉语新闻文章数据集。

Bangla news article dataset.

作者信息

Saad Asif Mohammed, Mahi Umme Niraj, Salim Md Shahidul, Hossain Sk Imran

机构信息

Khulna University of Engineering & Technology, Khulna 9203, Bangladesh.

出版信息

Data Brief. 2024 Aug 24;57:110874. doi: 10.1016/j.dib.2024.110874. eCollection 2024 Dec.

Abstract

In this research, we present an updated standard Bangla dataset based on gathered Bangla news articles. In total, more than 1.9 million articles from nine Bangla news websites were gathered; the selection process was led by a number of categories, including sports, economy, politics, local news, tech, tourism, entertainment, education, health, the arts, and many more. The dataset per newspaper contains varying attributes, such as title, content, time, tags, meta, category, etc. This dataset will enable data scientists to investigate and assess theories related to Bangla natural language processing. Furthermore, there is a greater chance that the dataset will be utilized for domain-specific large language models in the context of Bangladesh, and it may be used to develop deep learning and machine learning models that categorize articles according to subjects.

摘要

在本研究中,我们基于收集到的孟加拉语新闻文章展示了一个更新的标准孟加拉语数据集。总共从九个孟加拉语新闻网站收集了超过190万篇文章;选择过程由多个类别主导,包括体育、经济、政治、本地新闻、科技、旅游、娱乐、教育、健康、艺术等等。每个报纸的数据集包含不同的属性,如标题、内容、时间、标签、元数据、类别等。该数据集将使数据科学家能够研究和评估与孟加拉语自然语言处理相关的理论。此外,该数据集在孟加拉国的背景下更有可能被用于特定领域的大语言模型,并且它可用于开发根据主题对文章进行分类的深度学习和机器学习模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d3a6/11404080/094a9f3ca499/gr1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验