故事整理者：一个利用推特构建的关于社会语言、文化、社会经济和政治时间线的大型探索平台。

Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter.

作者信息

Alshaabi Thayer, Adams Jane L, Arnold Michael V, Minot Joshua R, Dewhurst David R, Reagan Andrew J, Danforth Christopher M, Dodds Peter Sheridan

机构信息

Vermont Complex Systems Center, University of Vermont, Burlington, VT 05405, USA.

Computational Story Lab, University of Vermont, Burlington, VT 05405, USA.

出版信息

Sci Adv. 2021 Jul 16;7(29). doi: 10.1126/sciadv.abe6534. Print 2021 Jul.

DOI:10.1126/sciadv.abe6534

PMID:34272243

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8284897/

Abstract

In real time, Twitter strongly imprints world events, popular culture, and the day-to-day, recording an ever-growing compendium of language change. Vitally, and absent from many standard corpora such as books and news archives, Twitter also encodes popularity and spreading through retweets. Here, we describe Storywrangler, an ongoing curation of over 100 billion tweets containing 1 trillion 1-grams from 2008 to 2021. For each day, we break tweets into 1-, 2-, and 3-grams across 100+ languages, generating frequencies for words, hashtags, handles, numerals, symbols, and emojis. We make the dataset available through an interactive time series viewer and as downloadable time series and daily distributions. Although Storywrangler leverages Twitter data, our method of tracking dynamic changes in -grams can be extended to any temporally evolving corpus. Illustrating the instrument's potential, we present example use cases including social amplification, the sociotechnical dynamics of famous individuals, box office success, and social unrest.

摘要

推特能够实时深刻地记录世界大事、流行文化以及日常动态，收录着不断增长的语言变化汇编。至关重要的是，与许多标准语料库（如书籍和新闻档案）不同，推特还通过转发对流行度和传播情况进行了编码。在此，我们介绍“故事整理者”，这是一项正在进行的整理工作，涵盖了2008年至2021年期间超过1000亿条推文，包含1万亿个单字组。对于每一天，我们将推文按100多种语言拆分为单字组、双字组和三字组，生成单词、主题标签、用户名、数字、符号和表情符号的频率。我们通过交互式时间序列查看器以及可下载的时间序列和每日分布来提供该数据集。尽管“故事整理者”利用了推特数据，但我们追踪单字组动态变化的方法可扩展到任何随时间演变的语料库。为说明该工具的潜力，我们展示了一些示例用例，包括社交放大效应、名人的社会技术动态、票房成功以及社会动荡。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/039f/8284897/4c0b711d7f90/abe6534-F1.jpg

相似文献

Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter.

Sci Adv. 2021 Jul 16;7(29). doi: 10.1126/sciadv.abe6534. Print 2021 Jul.

The growing amplification of social media: measuring temporal and social contagion dynamics for over 150 languages on Twitter for 2009-2020.

EPJ Data Sci. 2021;10(1):15. doi: 10.1140/epjds/s13688-021-00271-0. Epub 2021 Mar 31.

Twitter and Middle East respiratory syndrome, South Korea, 2015: A multi-lingual study.

Infect Dis Health. 2018 Mar;23(1):10-16. doi: 10.1016/j.idh.2017.08.005. Epub 2017 Sep 18.

How Health Care Workers Wield Influence Through Twitter Hashtags: Retrospective Cross-sectional Study of the Gun Violence and COVID-19 Public Health Crises.

JMIR Public Health Surveill. 2021 Jan 6;7(1):e24562. doi: 10.2196/24562.

A Twitter dataset for Monkeypox, May 2022.

Data Brief. 2023 Jun;48:109118. doi: 10.1016/j.dib.2023.109118. Epub 2023 Apr 14.

The Belt and Road Initiative on Twitter: An annotated dataset.

Data Brief. 2022 Nov 1;45:108711. doi: 10.1016/j.dib.2022.108711. eCollection 2022 Dec.

Social Media Insights Into US Mental Health During the COVID-19 Pandemic: Longitudinal Analysis of Twitter Data.

J Med Internet Res. 2020 Dec 14;22(12):e21418. doi: 10.2196/21418.

The pattern and use of Twitter among dental schools in Saudi Arabia.

PLoS One. 2022 Sep 8;17(9):e0272628. doi: 10.1371/journal.pone.0272628. eCollection 2022.

Analysis of Twitter Activity and Engagement From Annual Meetings of the Society for Vascular Surgery and the Society of Interventional Radiology.

Ann Vasc Surg. 2021 Oct;76:481-487. doi: 10.1016/j.avsg.2021.03.011. Epub 2021 Apr 5.

A first public dataset from Brazilian twitter and news on COVID-19 in Portuguese.

Data Brief. 2020 Oct;32:106179. doi: 10.1016/j.dib.2020.106179. Epub 2020 Aug 18.

引用本文的文献

When dialects collide: how socioeconomic mixing affects language use.

EPJ Data Sci. 2025;14(1):47. doi: 10.1140/epjds/s13688-025-00563-9. Epub 2025 Jul 10.

Language Statistics at Different Spatial, Temporal, and Grammatical Scales.

Entropy (Basel). 2024 Aug 29;26(9):734. doi: 10.3390/e26090734.

Understanding the Consumption of Antimicrobial Resistance-Related Content on Social Media: Twitter Analysis.

J Med Internet Res. 2023 Jun 12;25:e42363. doi: 10.2196/42363.

Twitter misogyny associated with Hillary Clinton increased throughout the 2016 U.S. election campaign.

Sci Rep. 2023 Mar 31;13(1):5266. doi: 10.1038/s41598-023-31620-w.

Say their names: Resurgence in the collective attention toward Black victims of fatal police violence following the death of George Floyd.

PLoS One. 2023 Jan 11;18(1):e0279225. doi: 10.1371/journal.pone.0279225. eCollection 2023.

Quantifying Changes in the Language Used Around Mental Health on Twitter Over 10 Years: Observational Study.

JMIR Ment Health. 2022 Mar 30;9(3):e33685. doi: 10.2196/33685.

Doomscrolling during COVID-19: The negative association between daily social and traditional media consumption and mental health symptoms during the COVID-19 pandemic.

Psychol Trauma. 2022 Nov;14(8):1338-1346. doi: 10.1037/tra0001202. Epub 2022 Feb 14.

Augmenting Semantic Lexicons Using Word Embeddings and Transfer Learning.

Front Artif Intell. 2022 Jan 24;4:783778. doi: 10.3389/frai.2021.783778. eCollection 2021.

Computational timeline reconstruction of the stories surrounding Trump: Story turbulence, narrative control, and collective chronopathy.

PLoS One. 2021 Dec 8;16(12):e0260592. doi: 10.1371/journal.pone.0260592. eCollection 2021.

Hurricanes and hashtags: Characterizing online collective attention for natural disasters.

PLoS One. 2021 May 26;16(5):e0251762. doi: 10.1371/journal.pone.0251762. eCollection 2021.

本文引用的文献

Computational timeline reconstruction of the stories surrounding Trump: Story turbulence, narrative control, and collective chronopathy.

PLoS One. 2021 Dec 8;16(12):e0260592. doi: 10.1371/journal.pone.0260592. eCollection 2021.

The growing amplification of social media: measuring temporal and social contagion dynamics for over 150 languages on Twitter for 2009-2020.

EPJ Data Sci. 2021;10(1):15. doi: 10.1140/epjds/s13688-021-00271-0. Epub 2021 Mar 31.

How the world's collective attention is being paid to a pandemic: COVID-19 related n-gram time series for 24 languages on Twitter.

PLoS One. 2021 Jan 6;16(1):e0244476. doi: 10.1371/journal.pone.0244476. eCollection 2021.

A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics.

Entropy (Basel). 2020 Jan 20;22(1):126. doi: 10.3390/e22010126.

Evaluating the fake news problem at the scale of the information ecosystem.

Sci Adv. 2020 Apr 3;6(14):eaay3539. doi: 10.1126/sciadv.aay3539. eCollection 2020 Apr.

Racial disparities in automated speech recognition.

Proc Natl Acad Sci U S A. 2020 Apr 7;117(14):7684-7689. doi: 10.1073/pnas.1915768117. Epub 2020 Mar 23.

Scaling in words on Twitter.

R Soc Open Sci. 2019 Oct 2;6(10):190027. doi: 10.1098/rsos.190027. eCollection 2019 Oct.

A systematic identification and analysis of scientists on Twitter.

PLoS One. 2017 Apr 11;12(4):e0175368. doi: 10.1371/journal.pone.0175368. eCollection 2017.

Pantheon 1.0, a manually verified dataset of globally famous biographies.

Sci Data. 2016 Jan 5;3:150075. doi: 10.1038/sdata.2015.75.

Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution.

PLoS One. 2015 Oct 7;10(10):e0137041. doi: 10.1371/journal.pone.0137041. eCollection 2015.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

故事整理者：一个利用推特构建的关于社会语言、文化、社会经济和政治时间线的大型探索平台。

Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter.

作者信息

Alshaabi Thayer, Adams Jane L, Arnold Michael V, Minot Joshua R, Dewhurst David R, Reagan Andrew J, Danforth Christopher M, Dodds Peter Sheridan

机构信息

Vermont Complex Systems Center, University of Vermont, Burlington, VT 05405, USA.

Computational Story Lab, University of Vermont, Burlington, VT 05405, USA.

出版信息

Sci Adv. 2021 Jul 16;7(29). doi: 10.1126/sciadv.abe6534. Print 2021 Jul.

DOI:10.1126/sciadv.abe6534

PMID:34272243

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8284897/

Abstract

摘要

故事整理者：一个利用推特构建的关于社会语言、文化、社会经济和政治时间线的大型探索平台。

Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

故事整理者：一个利用推特构建的关于社会语言、文化、社会经济和政治时间线的大型探索平台。

Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献