Otsuka Eriko, Wallace Scott A, Chiu David
School of Engineering and Computer Science, Washington State University, Vancouver, USA.
Department of Mathematics and Computer Science, University of Puget Sound, Tacoma, USA.
Comput Soc Netw. 2016;3(1):3. doi: 10.1186/s40649-016-0028-9. Epub 2016 May 31.
Twitter has evolved into a powerful communication and information sharing tool used by millions of people around the world to post what is happening now. A hashtag, a keyword prefixed with a hash symbol (#), is a feature in Twitter to organize tweets and facilitate effective search among a massive volume of data. In this paper, we propose an automatic hashtag recommendation system that helps users find new hashtags related to their interests on-demand.
For hashtag ranking, we propose the Hashtag Frequency-Inverse Hashtag Ubiquity (HF-IHU) ranking scheme, which is a variation of the well-known TF-IDF, that considers hashtag relevancy, as well as data sparseness which is one of the key challenges in analyzing microblog data. Our system is built on top of Hadoop, a leading platform for distributed computing, to provide scalable performance using Map-Reduce. Experiments on a large Twitter data set demonstrate that our method successfully yields relevant hashtags for user's interest and that recommendations are more stable and reliable than ranking tags based on tweet content similarity.
Our results show that HF-IHU can achieve over 30 % hashtag recall when asked to identify the top 10 relevant hashtags for a particular tweet. Furthermore, our method out-performs kNN, k-popularity, and Naïve Bayes by 69, 54, and 17 %, respectively, on recall of the top 200 hashtags.
推特已发展成为一种强大的通信和信息共享工具,全球数百万人用它来发布当下正在发生的事情。话题标签是推特中一种以井号(#)为前缀的关键词,用于组织推文并便于在海量数据中进行有效搜索。在本文中,我们提出了一种自动话题标签推荐系统,该系统可帮助用户按需找到与其兴趣相关的新话题标签。
对于话题标签排名,我们提出了话题标签频率-逆话题标签普遍性(HF-IHU)排名方案,它是著名的词频-逆文档频率(TF-IDF)的一种变体,该方案既考虑了话题标签的相关性,也考虑了数据稀疏性,而数据稀疏性是分析微博数据时的关键挑战之一。我们的系统构建在分布式计算的领先平台Hadoop之上,以使用Map-Reduce提供可扩展的性能。在一个大型推特数据集上进行的实验表明,我们的方法成功地为用户兴趣生成了相关的话题标签,并且与基于推文内容相似度对标签进行排名相比,推荐结果更稳定、更可靠。
我们的结果表明,当被要求为某条特定推文识别前10个相关话题标签时,HF-IHU的话题标签召回率可超过30%。此外,在召回前200个话题标签方面,我们的方法分别比k近邻算法、k流行度算法和朴素贝叶斯算法高出69%、54%和17%。