Alshaabi Thayer, Adams Jane L, Arnold Michael V, Minot Joshua R, Dewhurst David R, Reagan Andrew J, Danforth Christopher M, Dodds Peter Sheridan
Vermont Complex Systems Center, University of Vermont, Burlington, VT 05405, USA.
Computational Story Lab, University of Vermont, Burlington, VT 05405, USA.
Sci Adv. 2021 Jul 16;7(29). doi: 10.1126/sciadv.abe6534. Print 2021 Jul.
In real time, Twitter strongly imprints world events, popular culture, and the day-to-day, recording an ever-growing compendium of language change. Vitally, and absent from many standard corpora such as books and news archives, Twitter also encodes popularity and spreading through retweets. Here, we describe Storywrangler, an ongoing curation of over 100 billion tweets containing 1 trillion 1-grams from 2008 to 2021. For each day, we break tweets into 1-, 2-, and 3-grams across 100+ languages, generating frequencies for words, hashtags, handles, numerals, symbols, and emojis. We make the dataset available through an interactive time series viewer and as downloadable time series and daily distributions. Although Storywrangler leverages Twitter data, our method of tracking dynamic changes in -grams can be extended to any temporally evolving corpus. Illustrating the instrument's potential, we present example use cases including social amplification, the sociotechnical dynamics of famous individuals, box office success, and social unrest.
推特能够实时深刻地记录世界大事、流行文化以及日常动态,收录着不断增长的语言变化汇编。至关重要的是,与许多标准语料库(如书籍和新闻档案)不同,推特还通过转发对流行度和传播情况进行了编码。在此,我们介绍“故事整理者”,这是一项正在进行的整理工作,涵盖了2008年至2021年期间超过1000亿条推文,包含1万亿个单字组。对于每一天,我们将推文按100多种语言拆分为单字组、双字组和三字组,生成单词、主题标签、用户名、数字、符号和表情符号的频率。我们通过交互式时间序列查看器以及可下载的时间序列和每日分布来提供该数据集。尽管“故事整理者”利用了推特数据,但我们追踪单字组动态变化的方法可扩展到任何随时间演变的语料库。为说明该工具的潜力,我们展示了一些示例用例,包括社交放大效应、名人的社会技术动态、票房成功以及社会动荡。