Suppr超能文献

基于 LSTM 和词嵌入的社交媒体自动毒性分类。

An Automated Toxicity Classification on Social Media Using LSTM and Word Embedding.

机构信息

Yogananda School of Artificial Intelligence, Computing and Data Science, Shoolini University, Solan, Himachal Pradesh 173229, India.

Electronics and Communication Engineering Department, Maharishi Markandeshwar (Deemed to be University), Mullana, Ambala 133207, India.

出版信息

Comput Intell Neurosci. 2022 Feb 15;2022:8467349. doi: 10.1155/2022/8467349. eCollection 2022.

Abstract

The automated identification of toxicity in texts is a crucial area in text analysis since the social media world is replete with unfiltered content that ranges from mildly abusive to downright hateful. Researchers have found an unintended bias and unfairness caused by training datasets, which caused an inaccurate classification of toxic words in context. In this paper, several approaches for locating toxicity in texts are assessed and presented aiming to enhance the overall quality of text classification. General unsupervised methods were used depending on the state-of-art models and external embeddings to improve the accuracy while relieving bias and enhancing F1-score. Suggested approaches used a combination of long short-term memory (LSTM) deep learning model with Glove word embeddings and LSTM with word embeddings generated by the Bidirectional Encoder Representations from Transformers (BERT), respectively. These models were trained and tested on large secondary qualitative data containing a large number of comments classified as toxic or not. Results found that acceptable accuracy of 94% and an F1-score of 0.89 were achieved using LSTM with BERT word embeddings in the binary classification of comments (toxic and nontoxic). A combination of LSTM and BERT performed better than both LSTM unaccompanied and LSTM with Glove word embedding. This paper tries to solve the problem of classifying comments with high accuracy by pertaining models with larger corpora of text (high-quality word embedding) rather than the training data solely.

摘要

文本中毒性的自动识别是文本分析中的一个重要领域,因为社交媒体世界充满了未经过滤的内容,从轻度辱骂到彻头彻尾的仇恨都有。研究人员发现,由于训练数据集的存在,导致了无意的偏差和不公平,从而导致在上下文中对毒性词汇的分类不准确。本文评估并提出了几种文本中毒性定位的方法,旨在提高文本分类的整体质量。根据最新模型和外部嵌入,使用了一般的无监督方法来提高准确性,同时减轻偏差并提高 F1 分数。所提出的方法分别使用了长短期记忆(LSTM)深度学习模型与 Glove 词嵌入以及 LSTM 与来自转换器的双向编码器表示(BERT)生成的词嵌入相结合。这些模型在包含大量分类为毒性或非毒性评论的大型二次定性数据上进行了训练和测试。结果发现,在对评论进行二进制分类(毒性和非毒性)时,使用 BERT 词嵌入的 LSTM 可达到可接受的 94%准确率和 0.89 的 F1 分数。LSTM 和 BERT 的组合优于单独的 LSTM 和 LSTM 与 Glove 词嵌入。本文试图通过使用更大的文本语料库(高质量的词嵌入)而不是仅使用训练数据来解决分类具有高精度评论的问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a57f/8863472/7aa340977e0d/CIN2022-8467349.001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验