Suppr超能文献

一项关于推特话题分类的纵向研究。

A longitudinal study of topic classification on Twitter.

作者信息

Bouadjenek Mohamed Reda, Sanner Scott, Iman Zahra, Xie Lexing, Shi Daniel Xiaoliang

机构信息

School of Information Technology, Deakin University, Geelong, Victoria, Australia.

Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, Ontario, Canada.

出版信息

PeerJ Comput Sci. 2022 Jun 7;8:e991. doi: 10.7717/peerj-cs.991. eCollection 2022.

Abstract

Twitter represents a massively distributed information source over topics ranging from social and political events to entertainment and sports news. While recent work has suggested this content can be narrowed down to the personalized interests of individual users by training topic filters using standard classifiers, there remain many open questions about the efficacy of such classification-based filtering approaches. For example, over a year or more after training, how well do such classifiers generalize to future novel topical content, and are such results stable across a range of topics? In addition, how robust is a topic classifier over the time horizon, ., can a model trained in 1 year be used for making predictions in the subsequent year? Furthermore, what features, feature classes, and feature attributes are most critical for long-term classifier performance? To answer these questions, we collected a of over 800 million English Tweets the Twitter streaming API during 2013 and 2014 and learned topic classifiers for 10 diverse themes ranging from social issues to celebrity deaths to the "Iran nuclear deal". The results of this long-term study of topic classifier performance provide a number of important insights, among them that: (i) such classifiers can indeed generalize to novel topical content with high precision over a year or more after training though performance degrades with time, (ii) the classes of hashtags and simple terms contain the most informative feature instances, (iii) removing tweets containing training hashtags from the validation set allows better generalization, and (iv) the simple volume of tweets by a user correlates more with their informativeness than their follower or friend count. In summary, this work provides a long-term study of topic classifiers on Twitter that further justifies classification-based topical filtering approaches while providing detailed insight into the feature properties most critical for topic classifier performance.

摘要

推特是一个大规模分布式的信息源,涵盖从社会政治事件到娱乐体育新闻等各种主题。虽然最近的研究表明,通过使用标准分类器训练主题过滤器,可以将这些内容缩小到单个用户的个性化兴趣范围内,但关于这种基于分类的过滤方法的有效性仍有许多悬而未决的问题。例如,在训练一年或更长时间后,此类分类器对未来新出现的主题内容的泛化能力如何,以及这些结果在一系列主题中是否稳定?此外,主题分类器在较长时间范围内的稳健性如何,即,在某一年训练的模型能否用于次年的预测?此外,对于长期分类器性能而言,哪些特征、特征类别和特征属性最为关键?为了回答这些问题,我们在2013年和2014年期间通过推特流式应用程序编程接口收集了超过8亿条英文推文,并针对从社会问题到名人去世再到“伊朗核协议”等10个不同主题学习了主题分类器。这项关于主题分类器性能的长期研究结果提供了一些重要见解,其中包括:(i)此类分类器在训练一年或更长时间后确实能够高精度地泛化到新出现的主题内容,尽管性能会随着时间下降;(ii)主题标签和简单词汇类别包含最具信息性的特征实例;(iii)从验证集中删除包含训练主题标签的推文可实现更好的泛化;(iv)用户的推文简单数量与其信息性的相关性比其关注者或好友数量更高。总之,这项工作提供了对推特上主题分类器的长期研究,进一步证明了基于分类的主题过滤方法的合理性,同时深入洞察了对主题分类器性能最为关键的特征属性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/837c/9202616/313531ef18c9/peerj-cs-08-991-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验