Gasparini Francesca, Rizzi Giulia, Saibene Aurora, Fersini Elisabetta
Department of Informatics, Systems and Communication, University of Milano-Bicocca, Italy.
Data Brief. 2022 Aug 20;44:108526. doi: 10.1016/j.dib.2022.108526. eCollection 2022 Oct.
In this paper we present a benchmark dataset generated as part of a project for automatic identification of misogyny within online content, which focuses in particular on memes. The benchmark here described is composed of 800 memes collected from the most popular social media platforms, such as Facebook, Twitter, Instagram and Reddit, and consulting websites dedicated to collection and creation of memes. To gather misogynistic memes, specific keywords that refer to misogynistic content have been considered as search criterion, considering different manifestations of hatred against women, such as body shaming, stereotyping, objectification and violence. In parallel, memes with no misogynist content have been manually downloaded from the same web sources. Among all the collected memes, three domain experts have selected a dataset of 800 memes equally balanced between misogynistic and non-misogynistic ones. This dataset has been validated through a crowdsourcing platform, involving 60 subjects for the labelling process, in order to collect three evaluations for each instance. Two further binary labels have been collected from both the experts and the crowdsourcing platform, for memes evaluated as misogynistic, concerning aggressiveness and irony. Finally for each meme, the text has been manually transcribed. The dataset provided is thus composed of the 800 memes, the labels given by the experts and those obtained by the crowdsourcing validation, and the transcribed texts. This data can be used to approach the problem of automatic detection of misogynistic content on the Web relying on both textual and visual cues, facing phenomenons that are growing every day such as cybersexism and technology-facilitated violence.
在本文中,我们展示了一个基准数据集,该数据集是一个在线内容中厌女症自动识别项目的一部分,尤其侧重于表情包。这里描述的基准数据集由从最受欢迎的社交媒体平台(如脸书、推特、照片墙和红迪网)收集的800个表情包以及专门用于表情包收集和创作的咨询网站组成。为了收集厌女症表情包,考虑到针对女性的仇恨的不同表现形式,如身材羞辱、刻板印象、物化和暴力,将指代厌女症内容的特定关键词作为搜索标准。同时,从相同的网络来源手动下载了无厌女症内容的表情包。在所有收集到的表情包中,三位领域专家从厌女症和非厌女症表情包中挑选出了一个800个表情包的数据集,且两者数量均衡。该数据集已通过众包平台进行验证,在标注过程中涉及60名受试者,以便为每个实例收集三种评估结果。对于被评估为厌女症的表情包,还从专家和众包平台收集了另外两个二元标签,涉及攻击性和讽刺性。最后,对每个表情包的文本进行了手动转录。因此,所提供的数据集由800个表情包、专家给出的标签以及通过众包验证获得的标签和转录的文本组成。这些数据可用于依靠文本和视觉线索来解决网络上厌女症内容的自动检测问题,应对诸如网络性别歧视和技术助长的暴力等日益增多的现象。