Das Rajesh Kumar, Islam Mirajul, Khushbu Sharun Akter
Department of Computer Science and Engineering, Daffodil International University, Dhaka 1341, Bangladesh.
Data Brief. 2023 Jul 24;50:109445. doi: 10.1016/j.dib.2023.109445. eCollection 2023 Oct.
The Bangla Transformation of Sentence Classification dataset addresses the resource gap in natural language processing (NLP) for the Bangla language by providing a curated resource for Bangla sentence classification. With 3,793 annotated sentences, the dataset focuses on categorizing Bangla sentences into Simple, Complex, and Compound classes. It serves as a benchmark for evaluating NLP models on Bangla sentence classification, promoting linguistic diversity and inclusive language models. Collected from publicly accessible Facebook pages, the dataset ensures balanced representation across the categories. Preprocessing steps, including anonymization and duplicate removal, were applied. Three native Bangla speakers independently assessed the Transformation of Sentence labels, enhancing the dataset's reliability. The dataset empowers researchers, practitioners, and developers to build accurate and robust NLP models tailored to the Bangla language. It offers insights into Bangla syntax and structure, benefiting linguistic research. The dataset can be used to train models, uncover patterns in Bangla language usage, and develop effective NLP applications across domains.
孟加拉语句子分类数据集通过提供一个精心策划的孟加拉语句子分类资源,解决了自然语言处理(NLP)中孟加拉语的资源缺口问题。该数据集有3793个带注释的句子,专注于将孟加拉语句子分类为简单句、复合句和复杂句类别。它作为评估孟加拉语句子分类的NLP模型的基准,促进语言多样性和包容性语言模型的发展。该数据集从可公开访问的Facebook页面收集,确保了各类别之间的平衡代表性。应用了包括匿名化和重复数据删除在内的预处理步骤。三位以孟加拉语为母语的人士独立评估了句子标签的转换,提高了数据集的可靠性。该数据集使研究人员、从业者和开发者能够构建针对孟加拉语的准确且强大的NLP模型。它为孟加拉语的句法和结构提供了见解,有利于语言学研究。该数据集可用于训练模型、发现孟加拉语使用中的模式,并跨领域开发有效的NLP应用程序。