Bijoy Md Hasan Imam, Ayman Umme, Islam Md Monarul
Department of Computer Science and Engineering, Daffodil International University, Dhaka, 1216, Bangladesh.
Data Brief. 2025 Feb 19;59:111400. doi: 10.1016/j.dib.2025.111400. eCollection 2025 Apr.
Bengali, an Indo-Aryan language, features a complex grammatical structure with tenses, which is crucial for natural language processing (NLP) applications like text classification, machine translation, and sentiment analysis. The BanglaTense dataset is a large-scale, meticulously curated collection of Bangla sentences categorized by their tense: Past, present, and future. Addressing the resource gap in NLP for the Bangla language, BanglaTense provides a curated resource for Bangla sentence classification, featuring 17,819 annotated sentences, with 5,629 in the past tense, 6,101 in the present tense, and 6,089 in the future tense. This dataset is a benchmark for evaluating NLP models on Bangla sentence classification, promoting linguistic diversity and inclusive language models while ensuring balanced representation across categories. Preprocessing steps are applied to enhance data quality, including anonymization and duplicate removal. Three native Bangla speakers independently assessed the tense labels of the sentences, ensuring the dataset's reliability. BanglaTense is designed to advance research and development in NLP for Bangla, offering valuable applications in tense detection, text classification, language modeling, and educational tools. This dataset supports linguistic study and enhances the development of precise and context-aware NLP models by providing a robust foundation for temporal analysis in Bangla sentences. The dataset is openly available for academic and research purposes, promoting collaboration and innovation within the Bangla NLP community.
孟加拉语是一种印欧语系语言,具有复杂的时态语法结构,这对于诸如文本分类、机器翻译和情感分析等自然语言处理(NLP)应用至关重要。孟加拉语时态数据集是一个大规模、精心策划的孟加拉语句子集合,根据时态分类:过去时、现在时和将来时。为了解决孟加拉语在NLP方面的资源缺口,孟加拉语时态数据集为孟加拉语句子分类提供了一个精心策划的资源,包含17819个带注释的句子,其中5629个为过去时,6101个为现在时,6089个为将来时。该数据集是评估孟加拉语句子分类NLP模型的基准,促进语言多样性和包容性语言模型,同时确保各类别之间的平衡表示。应用预处理步骤以提高数据质量,包括匿名化和重复数据删除。三位以孟加拉语为母语的人独立评估了句子的时态标签,确保了数据集的可靠性。孟加拉语时态数据集旨在推动孟加拉语NLP的研究与开发,在时态检测、文本分类、语言建模和教育工具方面提供有价值的应用。该数据集通过为孟加拉语句子的时态分析提供坚实基础,支持语言学研究并促进精确和上下文感知NLP模型的开发。该数据集可公开用于学术和研究目的,促进孟加拉语NLP社区内的合作与创新。