基于语言和元数据特征预测推特用户的年龄组。

Predicting age groups of Twitter users based on language and metadata features.

作者信息

Morgan-Lopez Antonio A, Kim Annice E, Chew Robert F, Ruddle Paul

机构信息

Behavioral Health and Criminal Justice Research Division, RTI International, Research Triangle Park, North Carolina, United States of America.

Center for Health Policy Science & Tobacco Research, RTI International, Berkeley, California, United States of America.

出版信息

PLoS One. 2017 Aug 29;12(8):e0183537. doi: 10.1371/journal.pone.0183537. eCollection 2017.

DOI:10.1371/journal.pone.0183537

PMID:28850620

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5574558/

Abstract

Health organizations are increasingly using social media, such as Twitter, to disseminate health messages to target audiences. Determining the extent to which the target audience (e.g., age groups) was reached is critical to evaluating the impact of social media education campaigns. The main objective of this study was to examine the separate and joint predictive validity of linguistic and metadata features in predicting the age of Twitter users. We created a labeled dataset of Twitter users across different age groups (youth, young adults, adults) by collecting publicly available birthday announcement tweets using the Twitter Search application programming interface. We manually reviewed results and, for each age-labeled handle, collected the 200 most recent publicly available tweets and user handles' metadata. The labeled data were split into training and test datasets. We created separate models to examine the predictive validity of language features only, metadata features only, language and metadata features, and words/phrases from another age-validated dataset. We estimated accuracy, precision, recall, and F1 metrics for each model. An L1-regularized logistic regression model was conducted for each age group, and predicted probabilities between the training and test sets were compared for each age group. Cohen's d effect sizes were calculated to examine the relative importance of significant features. Models containing both Tweet language features and metadata features performed the best (74% precision, 74% recall, 74% F1) while the model containing only Twitter metadata features were least accurate (58% precision, 60% recall, and 57% F1 score). Top predictive features included use of terms such as "school" for youth and "college" for young adults. Overall, it was more challenging to predict older adults accurately. These results suggest that examining linguistic and Twitter metadata features to predict youth and young adult Twitter users may be helpful for informing public health surveillance and evaluation research.

摘要

卫生组织越来越多地利用社交媒体（如推特）向目标受众传播健康信息。确定目标受众（如年龄组）的覆盖范围对于评估社交媒体教育活动的影响至关重要。本研究的主要目的是检验语言和元数据特征在预测推特用户年龄方面的单独和联合预测效度。我们通过使用推特搜索应用程序编程接口收集公开可用的生日公告推文，创建了一个涵盖不同年龄组（青少年、青年、成年人）推特用户的标记数据集。我们手动审查结果，并为每个带有年龄标签的账号，收集200条最新的公开可用推文以及用户账号的元数据。标记数据被分为训练数据集和测试数据集。我们创建了单独的模型，以检验仅语言特征、仅元数据特征、语言和元数据特征以及来自另一个经过年龄验证的数据集的单词/短语的预测效度。我们为每个模型估计了准确率、精确率、召回率和F1指标。对每个年龄组进行了L1正则化逻辑回归模型，并比较了每个年龄组训练集和测试集之间的预测概率。计算了科恩d效应量以检验显著特征的相对重要性。包含推文语言特征和元数据特征的模型表现最佳（精确率74%，召回率74%，F1值74%），而仅包含推特元数据特征的模型最不准确（精确率58%，召回率60%，F1得分57%）。顶级预测特征包括青少年使用“学校”等词汇，青年使用“大学”等词汇。总体而言，准确预测成年人更具挑战性。这些结果表明，检查语言和推特元数据特征以预测青少年和青年推特用户可能有助于为公共卫生监测和评估研究提供信息。

相似文献

Predicting age groups of Twitter users based on language and metadata features.

PLoS One. 2017 Aug 29;12(8):e0183537. doi: 10.1371/journal.pone.0183537. eCollection 2017.

Predicting Age Groups of Reddit Users Based on Posting Behavior and Metadata: Classification Model Development and Validation.

JMIR Public Health Surveill. 2021 Mar 16;7(3):e25807. doi: 10.2196/25807.

Classification of Twitter Users Who Tweet About E-Cigarettes.

JMIR Public Health Surveill. 2017 Sep 26;3(3):e63. doi: 10.2196/publichealth.8060.

Trustworthy Health-Related Tweets on Social Media in Saudi Arabia: Tweet Metadata Analysis.

J Med Internet Res. 2019 Oct 8;21(10):e14731. doi: 10.2196/14731.

Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter.

J Biomed Inform. 2018 Nov;87:68-78. doi: 10.1016/j.jbi.2018.10.001. Epub 2018 Oct 4.

Applying Multiple Data Collection Tools to Quantify Human Papillomavirus Vaccine Communication on Twitter.

J Med Internet Res. 2016 Dec 5;18(12):e318. doi: 10.2196/jmir.6670.

Physical Activity, Sedentary Behavior, and Sleep on Twitter: Multicountry and Fully Labeled Public Data Set for Digital Public Health Surveillance Research.

JMIR Public Health Surveill. 2022 Feb 14;8(2):e32355. doi: 10.2196/32355.

Studying expressions of loneliness in individuals using twitter: an observational study.

BMJ Open. 2019 Nov 4;9(11):e030355. doi: 10.1136/bmjopen-2019-030355.

Identifying Patients With Inflammatory Bowel Disease on Twitter and Learning From Their Personal Experience: Retrospective Cohort Study.

J Med Internet Res. 2022 Aug 2;24(8):e29186. doi: 10.2196/29186.

Monitoring Physical Activity Levels Using Twitter Data: Infodemiology Study.

J Med Internet Res. 2019 Jun 3;21(6):e12394. doi: 10.2196/12394.

引用本文的文献

Modeling the Impacts of Governmental and Human Responses on COVID-19 Spread Using Statistical Machine Learning.

Int J Digit Earth. 2024;17(1). doi: 10.1080/17538947.2024.2434651. Epub 2024 Dec 9.

Which social media platforms facilitate monitoring the opioid crisis?

PLOS Digit Health. 2025 Apr 28;4(4):e0000842. doi: 10.1371/journal.pdig.0000842. eCollection 2025 Apr.

Mapping the Mpox discourse: A network and sentiment analysis.

Explor Res Clin Soc Pharm. 2024 Oct 9;16:100521. doi: 10.1016/j.rcsop.2024.100521. eCollection 2024 Dec.

Methods and Annotated Data Sets Used to Predict the Gender and Age of Twitter Users: Scoping Review.

J Med Internet Res. 2024 Mar 15;26:e47923. doi: 10.2196/47923.

Sentiment Analysis of Tweets on Menu Labeling Regulations in the US.

Nutrients. 2023 Oct 6;15(19):4269. doi: 10.3390/nu15194269.

Can accurate demographic information about people who use prescription medications nonmedically be derived from Twitter?

Proc Natl Acad Sci U S A. 2023 Feb 21;120(8):e2207391120. doi: 10.1073/pnas.2207391120. Epub 2023 Feb 14.

Delivery structure of nationalism message on Twitter in the context of Indonesian netizens.

Soc Netw Anal Min. 2022;12(1):173. doi: 10.1007/s13278-022-01006-3. Epub 2022 Dec 2.

MonkeyPox2022Tweets: A Large-Scale Twitter Dataset on the 2022 Monkeypox Outbreak, Findings from Analysis of Tweets, and Open Research Questions.

Infect Dis Rep. 2022 Nov 14;14(6):855-883. doi: 10.3390/idr14060087.

ReportAGE: Automatically extracting the exact age of Twitter users based on self-reports in tweets.

PLoS One. 2022 Jan 25;17(1):e0262087. doi: 10.1371/journal.pone.0262087. eCollection 2022.

Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality.

Proc Conf. 2021 Jun;2021:4515-4532. doi: 10.18653/v1/2021.naacl-main.357.

本文引用的文献

A content analysis of tweets about high-potency marijuana.

Drug Alcohol Depend. 2016 Sep 1;166:100-8. doi: 10.1016/j.drugalcdep.2016.06.034. Epub 2016 Jul 4.

Young Adults' Exposure to Alcohol- and Marijuana-Related Content on Twitter.

J Stud Alcohol Drugs. 2016 Mar;77(2):349-53. doi: 10.15288/jsad.2016.77.349.

Using Twitter Data to Gain Insights into E-cigarette Marketing and Locations of Use: An Infoveillance Study.

J Med Internet Res. 2015 Nov 6;17(11):e251. doi: 10.2196/jmir.4466.

Studying User Income through Language, Behaviour and Affect in Social Media.

PLoS One. 2015 Sep 22;10(9):e0138717. doi: 10.1371/journal.pone.0138717. eCollection 2015.

Who tweets? Deriving the demographic characteristics of age, occupation and social class from twitter user meta-data.

PLoS One. 2015 Mar 2;10(3):e0115545. doi: 10.1371/journal.pone.0115545. eCollection 2015.

From "Sooo excited!!!" to "So proud": using language to study development.

Dev Psychol. 2014 Jan;50(1):178-88. doi: 10.1037/a0035048. Epub 2013 Nov 25.

Personality, gender, and age in the language of social media: the open-vocabulary approach.

PLoS One. 2013 Sep 25;8(9):e73791. doi: 10.1371/journal.pone.0073791. eCollection 2013.

Psychological aspects of natural language. use: our words, our selves.

Annu Rev Psychol. 2003;54:547-77. doi: 10.1146/annurev.psych.54.101601.145041. Epub 2002 Jun 10.

Emerging adulthood. A theory of development from the late teens through the twenties.

Am Psychol. 2000 May;55(5):469-80.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于语言和元数据特征预测推特用户的年龄组。

Predicting age groups of Twitter users based on language and metadata features.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献