• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

策划网络欺凌数据集:一种人机协作方法。

Curating Cyberbullying Datasets: a Human-AI Collaborative Approach.

作者信息

Gomez Christopher E, Sztainberg Marcelo O, Trana Rachel E

机构信息

Department of Computer Science, Northeastern Illinois University, 5500 N St. Louis Ave, Chicago, IL 60625 USA.

出版信息

Int J Bullying Prev. 2022;4(1):35-46. doi: 10.1007/s42380-021-00114-6. Epub 2021 Dec 22.

DOI:10.1007/s42380-021-00114-6
PMID:34957375
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8691962/
Abstract

Cyberbullying is the use of digital communication tools and spaces to inflict physical, mental, or emotional distress. This serious form of aggression is frequently targeted at, but not limited to, vulnerable populations. A common problem when creating machine learning models to identify cyberbullying is the availability of accurately annotated, reliable, relevant, and diverse datasets. Datasets intended to train models for cyberbullying detection are typically annotated by human participants, which can introduce the following issues: (1) annotator bias, (2) incorrect annotation due to language and cultural barriers, and (3) the inherent subjectivity of the task can naturally create multiple valid labels for a given comment. The result can be a potentially inadequate dataset with one or more of these overlapping issues. We propose two machine learning approaches to identify and filter unambiguous comments in a cyberbullying dataset of roughly 19,000 comments collected from YouTube that was initially annotated using Amazon Mechanical Turk (AMT). Using consensus filtering methods, comments were classified as unambiguous when an agreement occurred between the AMT workers' majority label and the unanimous algorithmic filtering label. Comments identified as unambiguous were extracted and used to curate new datasets. We then used an artificial neural network to test for performance on these datasets. Compared to the original dataset, the classifier exhibits a large improvement in performance on modified versions of the dataset and can yield insight into the type of data that is consistently classified as bullying or non-bullying. This annotation approach can be expanded from cyberbullying datasets onto any classification corpus that has a similar complexity in scope.

摘要

网络欺凌是指利用数字通信工具和空间造成身体、心理或情感上的困扰。这种严重的攻击形式通常针对弱势群体,但不限于这些群体。在创建用于识别网络欺凌的机器学习模型时,一个常见问题是缺乏准确标注、可靠、相关且多样的数据集。用于训练网络欺凌检测模型的数据集通常由人类参与者进行标注,这可能会引发以下问题:(1)标注者偏差;(2)由于语言和文化障碍导致的错误标注;(3)任务本身的主观性自然会为给定评论产生多个有效标签。结果可能是一个存在上述一个或多个重叠问题的潜在不充分数据集。我们提出了两种机器学习方法,用于识别和筛选从YouTube收集的约19000条评论的网络欺凌数据集中的明确评论,该数据集最初是使用亚马逊土耳其机器人(AMT)进行标注的。使用共识过滤方法,当AMT工作者的多数标签与一致的算法过滤标签达成一致时,评论被分类为明确评论。被识别为明确的评论被提取出来用于整理新的数据集。然后我们使用人工神经网络在这些数据集上测试性能。与原始数据集相比,分类器在数据集的修改版本上表现出大幅性能提升,并且可以深入了解始终被分类为欺凌或非欺凌的数据类型。这种标注方法可以从网络欺凌数据集扩展到任何范围复杂度相似的分类语料库。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/012a/8691962/54a6b68f4681/42380_2021_114_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/012a/8691962/54a6b68f4681/42380_2021_114_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/012a/8691962/54a6b68f4681/42380_2021_114_Fig1_HTML.jpg

相似文献

1
Curating Cyberbullying Datasets: a Human-AI Collaborative Approach.策划网络欺凌数据集:一种人机协作方法。
Int J Bullying Prev. 2022;4(1):35-46. doi: 10.1007/s42380-021-00114-6. Epub 2021 Dec 22.
2
Addressing cyberbullying in Urdu tweets: a comprehensive dataset and detection system.解决乌尔都语推文中的网络欺凌问题:一个综合数据集和检测系统。
PeerJ Comput Sci. 2024 Apr 29;10:e1963. doi: 10.7717/peerj-cs.1963. eCollection 2024.
3
COVID-19 and cyberbullying: deep ensemble model to identify cyberbullying from code-switched languages during the pandemic.新冠疫情与网络欺凌:用于在疫情期间从语码转换语言中识别网络欺凌的深度集成模型
Multimed Tools Appl. 2023;82(6):8773-8789. doi: 10.1007/s11042-021-11601-9. Epub 2022 Jan 8.
4
ProTect: a hybrid deep learning model for proactive detection of cyberbullying on social media.ProTect:一种用于在社交媒体上主动检测网络欺凌的混合深度学习模型。
Front Artif Intell. 2024 Mar 6;7:1269366. doi: 10.3389/frai.2024.1269366. eCollection 2024.
5
An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples.一种基于多重过滤和监督属性聚类算法的集成机器学习模型,用于对癌症样本进行分类。
PeerJ Comput Sci. 2021 Sep 16;7:e671. doi: 10.7717/peerj-cs.671. eCollection 2021.
6
The design, construction and evaluation of annotated Arabic cyberbullying corpus.带注释的阿拉伯语网络欺凌语料库的设计、构建与评估。
Educ Inf Technol (Dordr). 2022;27(8):10977-11023. doi: 10.1007/s10639-022-11056-x. Epub 2022 Apr 28.
7
Automatic detection of cyberbullying in social media text.社交媒体文本中网络欺凌的自动检测。
PLoS One. 2018 Oct 8;13(10):e0203794. doi: 10.1371/journal.pone.0203794. eCollection 2018.
8
ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment.ToxLex_bn:一个从脸书评论中提取的孟加拉语有毒语言的精选数据集。
Data Brief. 2022 Jun 24;43:108416. doi: 10.1016/j.dib.2022.108416. eCollection 2022 Aug.
9
Crowdsourcing for Machine Learning in Public Health Surveillance: Lessons Learned From Amazon Mechanical Turk.公共卫生监测中机器学习的众包:从亚马逊土耳其机器人学到的经验教训。
J Med Internet Res. 2022 Jan 18;24(1):e28749. doi: 10.2196/28749.
10
The plausibility machine commonsense (PMC) dataset: A massively crowdsourced human-annotated dataset for studying plausibility in large language models.似真性机器常识(PMC)数据集:一个用于研究大语言模型中似真性的大规模众包人工标注数据集。
Data Brief. 2024 Aug 24;57:110869. doi: 10.1016/j.dib.2024.110869. eCollection 2024 Dec.

本文引用的文献

1
Defining Cyberbullying.界定网络欺凌。
Pediatrics. 2017 Nov;140(Suppl 2):S148-S151. doi: 10.1542/peds.2016-1758U.
2
Cyberbullying in Children and Youth: Implications for Health and Clinical Practice.儿童和青少年中的网络欺凌:对健康和临床实践的影响。
Can J Psychiatry. 2017 Jun;62(6):368-373. doi: 10.1177/0706743716684791. Epub 2016 Dec 19.
3
Trends in Cyberbullying and School Bullying Victimization in a Regional Census of High School Students, 2006-2012.2006 - 2012年高中生区域普查中的网络欺凌和校园欺凌受害情况趋势
J Sch Health. 2015 Sep;85(9):611-20. doi: 10.1111/josh.12290.
4
Current perspectives: the impact of cyberbullying on adolescent health.当前观点:网络欺凌对青少年健康的影响。
Adolesc Health Med Ther. 2014 Aug 1;5:143-58. doi: 10.2147/AHMT.S36456. eCollection 2014.
5
[How do adolescents in Germany define cyberbullying? A focus-group study of adolescents from a German major city].[德国青少年如何定义网络欺凌?对德国一个主要城市青少年的焦点小组研究]
Prax Kinderpsychol Kinderpsychiatr. 2014;63(5):361-78. doi: 10.13109/prkk.2014.63.5.361.
6
Cyberbullying definition among adolescents: a comparison across six European countries.青少年网络欺凌的定义:六个欧洲国家的比较。
Cyberpsychol Behav Soc Netw. 2012 Sep;15(9):455-63. doi: 10.1089/cyber.2012.0040. Epub 2012 Jul 20.
7
Cyberbullying: the challenge to define.网络欺凌:定义的挑战。
Cyberpsychol Behav Soc Netw. 2012 Jun;15(6):285-9. doi: 10.1089/cyber.2011.0588.
8
Defining cyberbullying: a qualitative research into the perceptions of youngsters.界定网络欺凌:一项关于青少年认知的定性研究。
Cyberpsychol Behav. 2008 Aug;11(4):499-503. doi: 10.1089/cpb.2007.0042.
9
The online disinhibition effect.网络去抑制效应。
Cyberpsychol Behav. 2004 Jun;7(3):321-6. doi: 10.1089/1094931041291295.