Sazzed Salim
Department of Computer Science, Old Dominion University, Norfolk, VA, USA.
PeerJ Comput Sci. 2021 Nov 16;7:e681. doi: 10.7717/peerj-cs.681. eCollection 2021.
Bengali is a low-resource language that lacks tools and resources for various natural language processing (NLP) tasks, such as sentiment analysis or profanity identification. In Bengali, only the translated versions of English sentiment lexicons are available. Moreover, no dictionary exists for detecting profanity in Bengali social media text. This study introduces a Bengali sentiment lexicon, BengSentiLex, and a Bengali swear lexicon, BengSwearLex. For creating BengSentiLex, a cross-lingual methodology is proposed that utilizes a machine translation system, a review corpus, two English sentiment lexicons, pointwise mutual information (PMI), and supervised machine learning (ML) classifiers in various stages. A semi-automatic methodology is presented to develop BengSwearLex that leverages an obscene corpus, word embedding, and part-of-speech (POS) taggers. The performance of BengSentiLex compared with the translated English lexicons in three evaluation datasets. BengSentiLex achieves 5%-50% improvement over the translated lexicons. For identifying profanity, BengSwearLex achieves documentlevel coverage of around 85% in an document-level in the evaluation dataset. The experimental results imply that BengSentiLex and BengSwearLex are effective resources for classifying sentiment and identifying profanity in Bengali social media content, respectively.
孟加拉语是一种资源匮乏的语言,缺乏用于各种自然语言处理(NLP)任务的工具和资源,如情感分析或亵渎性识别。在孟加拉语中,只有英语情感词典的翻译版本可用。此外,不存在用于检测孟加拉语社交媒体文本中亵渎性内容的词典。本研究介绍了一个孟加拉语情感词典BengSentiLex和一个孟加拉语脏话词典BengSwearLex。为了创建BengSentiLex,提出了一种跨语言方法,该方法在不同阶段利用机器翻译系统、评论语料库、两个英语情感词典、逐点互信息(PMI)和监督机器学习(ML)分类器。提出了一种半自动方法来开发BengSwearLex,该方法利用淫秽语料库、词嵌入和词性(POS)标注器。在三个评估数据集中将BengSentiLex的性能与翻译后的英语词典进行了比较。BengSentiLex比翻译后的词典有5%-50%的提升。对于亵渎性内容的识别,BengSwearLex在评估数据集中的文档级覆盖率达到了约85%。实验结果表明,BengSentiLex和BengSwearLex分别是用于对孟加拉语社交媒体内容进行情感分类和识别亵渎性内容的有效资源。