Carrella Fabio, Miani Alessandro, Lewandowsky Stephan
School of Psychological Science, University of Bristol.
Institute of Work and Organizational Psychology, University of Neuchâtel.
Proc Conf Assoc Comput Linguist Meet. 2023 May;2023:2339-2349. Epub 2023 May 1.
The dissemination of false information on the internet has received considerable attention over the last decade. Misinformation often spreads faster than mainstream news, thus making manual fact checking inefficient or, at best, labor-intensive. Therefore, there is an increasing need to develop methods for automatic detection of misinformation. Although resources for creating such methods are available in English, other languages are often underrepresented in this effort. With this contribution, we present IRMA, a corpus containing over 600,000 Italian news articles (335+ million tokens) collected from 56 websites classified as 'untrustworthy' by professional factcheckers. The corpus is freely available and comprises a rich set of text- and website-level data, representing a turnkey resource to test hypotheses and develop automatic detection algorithms. It contains texts, titles, and dates (from 2004 to 2022), along with three types of semantic measures (i.e., keywords, topics at three different resolutions, and LIWC lexical features). IRMA also includes domainspecific information such as source type (e.g., political, health, conspiracy, etc.), quality, and higher-level metadata, including several metrics of website incoming traffic that allow to investigate user online behavior. IRMA constitutes the largest corpus of misinformation available today in Italian, making it a valid tool for advancing quantitative research on untrustworthy news detection and ultimately helping limit the spread of misinformation.
在过去十年中,互联网上虚假信息的传播受到了广泛关注。错误信息的传播速度往往比主流新闻更快,因此人工事实核查效率低下,充其量也只是劳动密集型的。因此,开发自动检测错误信息的方法的需求日益增加。尽管创建此类方法的资源在英语中可用,但在这项工作中,其他语言的资源往往较少。通过这项贡献,我们展示了IRMA,这是一个语料库,包含从56个被专业事实核查人员归类为“不可信”的网站收集的60多万篇意大利新闻文章(3.35亿多个词元)。该语料库可免费获取,包含丰富的文本级和网站级数据,是测试假设和开发自动检测算法的一站式资源。它包含文本、标题和日期(从2004年到2022年),以及三种语义度量(即关键词、三种不同分辨率的主题和LIWC词汇特征)。IRMA还包括特定领域的信息,如来源类型(如政治、健康、阴谋等)、质量和更高级别的元数据,包括几个网站流量指标,可用于调查用户的在线行为。IRMA是目前意大利语中最大的错误信息语料库,使其成为推进不可信新闻检测定量研究并最终帮助限制错误信息传播的有效工具。