Sumikawa Yasunobu, Jatowt Adam
Takushoku University, Japan.
University of Innsbruck, Austria.
Data Brief. 2021 Sep 4;38:107344. doi: 10.1016/j.dib.2021.107344. eCollection 2021 Oct.
In this article, we present a dataset containing history-related content obtained from social media. It contains hashtags and tweets that include these hashtags, as well as the results of third party tools applied to the tweets that include extracted entities, years, and url categories, and the categories for the history-related hashtags we used to crawl the tweets. We collected the tweets from Twitter official API using hashtag-based crawling. The crawling process had been performed from March 2016 to July 2018. During the crawling, we applied a bootstrapping approach which is an iterative process of collecting tweets using a small set of seed hashtags, and a manual inspection of newly acquired hashtags that co-occur with the seed hashtags to include those they are related to history. Finally, we collected 147 history-related hashtags and 2,370,252 tweets. We then defined 6 categories for the collected hashtags after their manual investigation. The presented dataset could be useful for further analysis on how people refer to history in Twitter, for collecting new history-related tweets, for training classifiers to detect history-related tweets, or for further investigations of the proposed hashtag categories.
在本文中,我们展示了一个包含从社交媒体获取的与历史相关内容的数据集。它包含主题标签以及包含这些主题标签的推文,还有应用于包含提取实体、年份和网址类别的推文的第三方工具的结果,以及我们用于抓取推文的与历史相关主题标签的类别。我们使用基于主题标签的抓取方式从推特官方应用程序编程接口(API)收集推文。抓取过程从2016年3月持续到2018年7月。在抓取期间,我们采用了一种自展方法,这是一个使用一小组种子主题标签收集推文的迭代过程,以及对与种子主题标签同时出现的新获取主题标签进行人工检查,以纳入那些与历史相关的标签。最后,我们收集了147个与历史相关的主题标签和2370252条推文。然后,在人工调查之后,我们为收集到的主题标签定义了6个类别。所展示的数据集可能有助于进一步分析人们在推特中如何提及历史,有助于收集新的与历史相关的推文,有助于训练分类器以检测与历史相关的推文,或者有助于对所提出的主题标签类别进行进一步研究。