Suppr超能文献

BTSD:孟加拉语用于文本分类的句子数据集的精心整理转换。

BTSD: A curated transformation of sentence dataset for text classification in Bangla language.

作者信息

Das Rajesh Kumar, Islam Mirajul, Khushbu Sharun Akter

机构信息

Department of Computer Science and Engineering, Daffodil International University, Dhaka 1341, Bangladesh.

出版信息

Data Brief. 2023 Jul 24;50:109445. doi: 10.1016/j.dib.2023.109445. eCollection 2023 Oct.

Abstract

The Bangla Transformation of Sentence Classification dataset addresses the resource gap in natural language processing (NLP) for the Bangla language by providing a curated resource for Bangla sentence classification. With 3,793 annotated sentences, the dataset focuses on categorizing Bangla sentences into Simple, Complex, and Compound classes. It serves as a benchmark for evaluating NLP models on Bangla sentence classification, promoting linguistic diversity and inclusive language models. Collected from publicly accessible Facebook pages, the dataset ensures balanced representation across the categories. Preprocessing steps, including anonymization and duplicate removal, were applied. Three native Bangla speakers independently assessed the Transformation of Sentence labels, enhancing the dataset's reliability. The dataset empowers researchers, practitioners, and developers to build accurate and robust NLP models tailored to the Bangla language. It offers insights into Bangla syntax and structure, benefiting linguistic research. The dataset can be used to train models, uncover patterns in Bangla language usage, and develop effective NLP applications across domains.

摘要

孟加拉语句子分类数据集通过提供一个精心策划的孟加拉语句子分类资源,解决了自然语言处理(NLP)中孟加拉语的资源缺口问题。该数据集有3793个带注释的句子,专注于将孟加拉语句子分类为简单句、复合句和复杂句类别。它作为评估孟加拉语句子分类的NLP模型的基准,促进语言多样性和包容性语言模型的发展。该数据集从可公开访问的Facebook页面收集,确保了各类别之间的平衡代表性。应用了包括匿名化和重复数据删除在内的预处理步骤。三位以孟加拉语为母语的人士独立评估了句子标签的转换,提高了数据集的可靠性。该数据集使研究人员、从业者和开发者能够构建针对孟加拉语的准确且强大的NLP模型。它为孟加拉语的句法和结构提供了见解,有利于语言学研究。该数据集可用于训练模型、发现孟加拉语使用中的模式,并跨领域开发有效的NLP应用程序。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b10f/10415831/e16707d67196/gr1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验