一种用于推特数据流的主题标签推荐系统。

A hashtag recommendation system for twitter data streams.

作者信息

Otsuka Eriko, Wallace Scott A, Chiu David

机构信息

School of Engineering and Computer Science, Washington State University, Vancouver, USA.

Department of Mathematics and Computer Science, University of Puget Sound, Tacoma, USA.

出版信息

Comput Soc Netw. 2016;3(1):3. doi: 10.1186/s40649-016-0028-9. Epub 2016 May 31.

DOI:10.1186/s40649-016-0028-9

PMID:29355223

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5749337/

Abstract

BACKGROUND

Twitter has evolved into a powerful communication and information sharing tool used by millions of people around the world to post what is happening now. A hashtag, a keyword prefixed with a hash symbol (#), is a feature in Twitter to organize tweets and facilitate effective search among a massive volume of data. In this paper, we propose an automatic hashtag recommendation system that helps users find new hashtags related to their interests on-demand.

METHODS

For hashtag ranking, we propose the Hashtag Frequency-Inverse Hashtag Ubiquity (HF-IHU) ranking scheme, which is a variation of the well-known TF-IDF, that considers hashtag relevancy, as well as data sparseness which is one of the key challenges in analyzing microblog data. Our system is built on top of Hadoop, a leading platform for distributed computing, to provide scalable performance using Map-Reduce. Experiments on a large Twitter data set demonstrate that our method successfully yields relevant hashtags for user's interest and that recommendations are more stable and reliable than ranking tags based on tweet content similarity.

RESULTS AND CONCLUSIONS

Our results show that HF-IHU can achieve over 30 % hashtag recall when asked to identify the top 10 relevant hashtags for a particular tweet. Furthermore, our method out-performs kNN, k-popularity, and Naïve Bayes by 69, 54, and 17 %, respectively, on recall of the top 200 hashtags.

摘要

背景

推特已发展成为一种强大的通信和信息共享工具，全球数百万人用它来发布当下正在发生的事情。话题标签是推特中一种以井号（#）为前缀的关键词，用于组织推文并便于在海量数据中进行有效搜索。在本文中，我们提出了一种自动话题标签推荐系统，该系统可帮助用户按需找到与其兴趣相关的新话题标签。

方法

对于话题标签排名，我们提出了话题标签频率-逆话题标签普遍性（HF-IHU）排名方案，它是著名的词频-逆文档频率（TF-IDF）的一种变体，该方案既考虑了话题标签的相关性，也考虑了数据稀疏性，而数据稀疏性是分析微博数据时的关键挑战之一。我们的系统构建在分布式计算的领先平台Hadoop之上，以使用Map-Reduce提供可扩展的性能。在一个大型推特数据集上进行的实验表明，我们的方法成功地为用户兴趣生成了相关的话题标签，并且与基于推文内容相似度对标签进行排名相比，推荐结果更稳定、更可靠。

结果与结论

我们的结果表明，当被要求为某条特定推文识别前10个相关话题标签时，HF-IHU的话题标签召回率可超过30%。此外，在召回前200个话题标签方面，我们的方法分别比k近邻算法、k流行度算法和朴素贝叶斯算法高出69%、54%和17%。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

一种用于推特数据流的主题标签推荐系统。

A hashtag recommendation system for twitter data streams.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS AND CONCLUSIONS

背景

方法

结果与结论

相似文献

引用本文的文献

本文引用的文献

一种用于推特数据流的主题标签推荐系统。

A hashtag recommendation system for twitter data streams.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS AND CONCLUSIONS

背景

方法

结果与结论

相似文献

引用本文的文献

本文引用的文献