Suppr超能文献

RuSentiTweet:一个俄语通用领域推文的情感分析数据集。

RuSentiTweet: a sentiment analysis dataset of general domain tweets in Russian.

作者信息

Smetanin Sergey

机构信息

Department of Business Informatics, Graduate School of Business, National Research University Higher School of Economics, Russia.

出版信息

PeerJ Comput Sci. 2022 Jul 19;8:e1039. doi: 10.7717/peerj-cs.1039. eCollection 2022.

Abstract

The Russian language is still not as well-resourced as English, especially in the field of sentiment analysis of Twitter content. Though several sentiment analysis datasets of tweets in Russia exist, they all are either automatically annotated or manually annotated by one annotator. Thus, there is no inter-annotator agreement, or annotation may be focused on a specific domain. In this article, we present RuSentiTweet, a new sentiment analysis dataset of general domain tweets in Russian. RuSentiTweet is currently the largest in its class for Russian, with 13,392 tweets manually annotated with moderate inter-rater agreement into five classes: Positive, Neutral, Negative, Speech Act, and Skip. As a source of data, we used Twitter Stream Grab, a historical collection of tweets obtained from the general Twitter API stream, which provides a 1% sample of the public tweets. Additionally, we released a RuBERT-based sentiment classification model that achieved = 0.6594 on the test subset.

摘要

俄语的资源仍然不如英语丰富,尤其是在推特内容情感分析领域。尽管存在一些俄罗斯推文的情感分析数据集,但它们都是由一个注释者自动注释或手动注释的。因此,不存在注释者间的一致性,或者注释可能集中在特定领域。在本文中,我们展示了RuSentiTweet,这是一个新的俄语通用领域推文情感分析数据集。RuSentiTweet目前是俄语同类数据集中最大的,有13392条推文被手动注释,注释者间一致性适中,分为五类:积极、中性、消极、言语行为和跳过。作为数据来源,我们使用了Twitter Stream Grab,这是一个从通用推特应用程序编程接口流中获取的推文历史集合,它提供了1%的公开推文样本。此外,我们发布了一个基于RuBERT的情感分类模型,该模型在测试子集中的F1值为0.6594。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9c03/9454938/268e79f0af64/peerj-cs-08-1039-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验