• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

孟加拉语情感词典和孟加拉语脏话词典:为低资源孟加拉语的情感分析和亵渎检测创建词汇表。

BengSentiLex and BengSwearLex: creating lexicons for sentiment analysis and profanity detection in low-resource Bengali language.

作者信息

Sazzed Salim

机构信息

Department of Computer Science, Old Dominion University, Norfolk, VA, USA.

出版信息

PeerJ Comput Sci. 2021 Nov 16;7:e681. doi: 10.7717/peerj-cs.681. eCollection 2021.

DOI:10.7717/peerj-cs.681
PMID:34901419
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8627231/
Abstract

Bengali is a low-resource language that lacks tools and resources for various natural language processing (NLP) tasks, such as sentiment analysis or profanity identification. In Bengali, only the translated versions of English sentiment lexicons are available. Moreover, no dictionary exists for detecting profanity in Bengali social media text. This study introduces a Bengali sentiment lexicon, BengSentiLex, and a Bengali swear lexicon, BengSwearLex. For creating BengSentiLex, a cross-lingual methodology is proposed that utilizes a machine translation system, a review corpus, two English sentiment lexicons, pointwise mutual information (PMI), and supervised machine learning (ML) classifiers in various stages. A semi-automatic methodology is presented to develop BengSwearLex that leverages an obscene corpus, word embedding, and part-of-speech (POS) taggers. The performance of BengSentiLex compared with the translated English lexicons in three evaluation datasets. BengSentiLex achieves 5%-50% improvement over the translated lexicons. For identifying profanity, BengSwearLex achieves documentlevel coverage of around 85% in an document-level in the evaluation dataset. The experimental results imply that BengSentiLex and BengSwearLex are effective resources for classifying sentiment and identifying profanity in Bengali social media content, respectively.

摘要

孟加拉语是一种资源匮乏的语言,缺乏用于各种自然语言处理(NLP)任务的工具和资源,如情感分析或亵渎性识别。在孟加拉语中,只有英语情感词典的翻译版本可用。此外,不存在用于检测孟加拉语社交媒体文本中亵渎性内容的词典。本研究介绍了一个孟加拉语情感词典BengSentiLex和一个孟加拉语脏话词典BengSwearLex。为了创建BengSentiLex,提出了一种跨语言方法,该方法在不同阶段利用机器翻译系统、评论语料库、两个英语情感词典、逐点互信息(PMI)和监督机器学习(ML)分类器。提出了一种半自动方法来开发BengSwearLex,该方法利用淫秽语料库、词嵌入和词性(POS)标注器。在三个评估数据集中将BengSentiLex的性能与翻译后的英语词典进行了比较。BengSentiLex比翻译后的词典有5%-50%的提升。对于亵渎性内容的识别,BengSwearLex在评估数据集中的文档级覆盖率达到了约85%。实验结果表明,BengSentiLex和BengSwearLex分别是用于对孟加拉语社交媒体内容进行情感分类和识别亵渎性内容的有效资源。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4178/8627231/9ab8c16ea9dd/peerj-cs-07-681-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4178/8627231/af95348ce367/peerj-cs-07-681-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4178/8627231/deac7764fb98/peerj-cs-07-681-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4178/8627231/73446fdd24e2/peerj-cs-07-681-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4178/8627231/02eae556d525/peerj-cs-07-681-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4178/8627231/2c356c778dd5/peerj-cs-07-681-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4178/8627231/9ab8c16ea9dd/peerj-cs-07-681-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4178/8627231/af95348ce367/peerj-cs-07-681-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4178/8627231/deac7764fb98/peerj-cs-07-681-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4178/8627231/73446fdd24e2/peerj-cs-07-681-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4178/8627231/02eae556d525/peerj-cs-07-681-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4178/8627231/2c356c778dd5/peerj-cs-07-681-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4178/8627231/9ab8c16ea9dd/peerj-cs-07-681-g006.jpg

相似文献

1
BengSentiLex and BengSwearLex: creating lexicons for sentiment analysis and profanity detection in low-resource Bengali language.孟加拉语情感词典和孟加拉语脏话词典:为低资源孟加拉语的情感分析和亵渎检测创建词汇表。
PeerJ Comput Sci. 2021 Nov 16;7:e681. doi: 10.7717/peerj-cs.681. eCollection 2021.
2
Building lexicon-based sentiment analysis model for low-resource languages.为低资源语言构建基于词典的情感分析模型。
MethodsX. 2023 Oct 22;11:102460. doi: 10.1016/j.mex.2023.102460. eCollection 2023 Dec.
3
Identifying vulgarity in Bengali social media textual content.识别孟加拉语社交媒体文本内容中的低俗信息。
PeerJ Comput Sci. 2021 Oct 19;7:e665. doi: 10.7717/peerj-cs.665. eCollection 2021.
4
Building and evaluating resources for sentiment analysis in the Greek language.构建和评估希腊语情感分析资源。
Lang Resour Eval. 2018;52(4):1021-1044. doi: 10.1007/s10579-018-9420-4. Epub 2018 Jul 14.
5
A new word embedding model integrated with medical knowledge for deep learning-based sentiment classification.一种集成医学知识的新词嵌入模型,用于基于深度学习的情感分类。
Artif Intell Med. 2024 Feb;148:102758. doi: 10.1016/j.artmed.2023.102758. Epub 2024 Jan 8.
6
Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora.从未标注语料库中诱导特定领域情感词典。
Proc Conf Empir Methods Nat Lang Process. 2016 Nov;2016:595-605. doi: 10.18653/v1/D16-1057.
7
Neuro-fuzzy network incorporating multiple lexicons for social sentiment analysis.用于社会情感分析的融合多词汇表的神经模糊网络。
Soft comput. 2022;26(9):4487-4507. doi: 10.1007/s00500-021-06528-0. Epub 2021 Nov 29.
8
Detection of Depression Severity Using Bengali Social Media Posts on Mental Health: Study Using Natural Language Processing Techniques.利用孟加拉语心理健康社交媒体帖子检测抑郁症严重程度:使用自然语言处理技术的研究
JMIR Form Res. 2022 Sep 28;6(9):e36118. doi: 10.2196/36118.
9
Using General-purpose Sentiment Lexicons for Suicide Risk Assessment in Electronic Health Records: Corpus-Based Analysis.利用通用情感词典进行电子健康记录中的自杀风险评估:基于语料库的分析。
JMIR Med Inform. 2021 Apr 13;9(4):e22397. doi: 10.2196/22397.
10
tRF-BERT: A transformative approach to aspect-based sentiment analysis in the bengali language.tRF-BERT:一种用于孟加拉语基于方面的情感分析的变革方法。
PLoS One. 2024 Sep 20;19(9):e0308050. doi: 10.1371/journal.pone.0308050. eCollection 2024.

引用本文的文献

1
Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages.用于增强低资源语言文本分类的文本数据增强和预训练语言模型。
PeerJ Comput Sci. 2024 Mar 29;10:e1974. doi: 10.7717/peerj-cs.1974. eCollection 2024.
2
Building lexicon-based sentiment analysis model for low-resource languages.为低资源语言构建基于词典的情感分析模型。
MethodsX. 2023 Oct 22;11:102460. doi: 10.1016/j.mex.2023.102460. eCollection 2023 Dec.

本文引用的文献

1
Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora.从未标注语料库中诱导特定领域情感词典。
Proc Conf Empir Methods Nat Lang Process. 2016 Nov;2016:595-605. doi: 10.18653/v1/D16-1057.