Hossain Tanzir, Islam Ar-Rafi, Mehedi Md Humaion Kabir, Rasel Annajiat Alim, Abdullah-Al-Wadud M, Uddin Jia
Department of Computer Science and Engineering, BRAC University, Dhaka, Bangladesh.
Department of Software Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia.
PLoS One. 2025 Jun 9;20(6):e0321291. doi: 10.1371/journal.pone.0321291. eCollection 2025.
Bangla news floods the web, and the need for smarter and more efficient classification techniques is greater than ever. Previous studies mostly focused on traditional models, overlooking the potential of hybrid techniques to handle the ever-growing complex dataset and its linguistic patterns in Bangla to achieve higher accuracy. Addressing the challenge, this study presents a comprehensive approach to classify Bangla news articles into eight distinct categories using various machine learning and deep learning techniques. The use of traditional machine learning algorithms, deep learning architectures, and hybrid models, including novel stacking classifiers, was a part of our experiment. This study utilized a dataset of 118,404 Bangla news articles, applying rigorous feature extraction techniques including TF-IDF vectorization and word2Vec embeddings. Our best-performing model, a stacking meta-classifier combining bidirectional long short-term memory and support vector machine, achieved a remarkable 94% accuracy, leaving all basic models' performance behind. Also, we provided an in-depth analysis of model performances, including confusion matrices, ROC curves, and error analysis, offering insights into the strengths and limitations of each approach. This research contributes significantly to the field of Bangla natural language processing and demonstrates the efficacy of ensemble methods and deep learning in news classification for low-resource languages.
孟加拉语新闻充斥着网络,对更智能、更高效的分类技术的需求比以往任何时候都更加迫切。以往的研究大多集中在传统模型上,忽视了混合技术在处理不断增长的复杂数据集及其孟加拉语语言模式以实现更高准确性方面的潜力。为应对这一挑战,本研究提出了一种综合方法,使用各种机器学习和深度学习技术将孟加拉语新闻文章分类为八个不同的类别。使用传统机器学习算法、深度学习架构以及包括新型堆叠分类器在内的混合模型是我们实验的一部分。本研究利用了一个包含118,404篇孟加拉语新闻文章的数据集,应用了包括TF-IDF向量化和word2Vec嵌入在内的严格特征提取技术。我们表现最佳的模型,即一种结合了双向长短期记忆和支持向量机的堆叠元分类器,达到了94%的显著准确率,超过了所有基本模型的性能。此外,我们还对模型性能进行了深入分析,包括混淆矩阵、ROC曲线和误差分析,深入了解了每种方法的优势和局限性。这项研究对孟加拉语自然语言处理领域做出了重大贡献,并证明了集成方法和深度学习在低资源语言新闻分类中的有效性。