Morgan-Lopez Antonio A, Kim Annice E, Chew Robert F, Ruddle Paul
Behavioral Health and Criminal Justice Research Division, RTI International, Research Triangle Park, North Carolina, United States of America.
Center for Health Policy Science & Tobacco Research, RTI International, Berkeley, California, United States of America.
PLoS One. 2017 Aug 29;12(8):e0183537. doi: 10.1371/journal.pone.0183537. eCollection 2017.
Health organizations are increasingly using social media, such as Twitter, to disseminate health messages to target audiences. Determining the extent to which the target audience (e.g., age groups) was reached is critical to evaluating the impact of social media education campaigns. The main objective of this study was to examine the separate and joint predictive validity of linguistic and metadata features in predicting the age of Twitter users. We created a labeled dataset of Twitter users across different age groups (youth, young adults, adults) by collecting publicly available birthday announcement tweets using the Twitter Search application programming interface. We manually reviewed results and, for each age-labeled handle, collected the 200 most recent publicly available tweets and user handles' metadata. The labeled data were split into training and test datasets. We created separate models to examine the predictive validity of language features only, metadata features only, language and metadata features, and words/phrases from another age-validated dataset. We estimated accuracy, precision, recall, and F1 metrics for each model. An L1-regularized logistic regression model was conducted for each age group, and predicted probabilities between the training and test sets were compared for each age group. Cohen's d effect sizes were calculated to examine the relative importance of significant features. Models containing both Tweet language features and metadata features performed the best (74% precision, 74% recall, 74% F1) while the model containing only Twitter metadata features were least accurate (58% precision, 60% recall, and 57% F1 score). Top predictive features included use of terms such as "school" for youth and "college" for young adults. Overall, it was more challenging to predict older adults accurately. These results suggest that examining linguistic and Twitter metadata features to predict youth and young adult Twitter users may be helpful for informing public health surveillance and evaluation research.
卫生组织越来越多地利用社交媒体(如推特)向目标受众传播健康信息。确定目标受众(如年龄组)的覆盖范围对于评估社交媒体教育活动的影响至关重要。本研究的主要目的是检验语言和元数据特征在预测推特用户年龄方面的单独和联合预测效度。我们通过使用推特搜索应用程序编程接口收集公开可用的生日公告推文,创建了一个涵盖不同年龄组(青少年、青年、成年人)推特用户的标记数据集。我们手动审查结果,并为每个带有年龄标签的账号,收集200条最新的公开可用推文以及用户账号的元数据。标记数据被分为训练数据集和测试数据集。我们创建了单独的模型,以检验仅语言特征、仅元数据特征、语言和元数据特征以及来自另一个经过年龄验证的数据集的单词/短语的预测效度。我们为每个模型估计了准确率、精确率、召回率和F1指标。对每个年龄组进行了L1正则化逻辑回归模型,并比较了每个年龄组训练集和测试集之间的预测概率。计算了科恩d效应量以检验显著特征的相对重要性。包含推文语言特征和元数据特征的模型表现最佳(精确率74%,召回率74%,F1值74%),而仅包含推特元数据特征的模型最不准确(精确率58%,召回率60%,F1得分57%)。顶级预测特征包括青少年使用“学校”等词汇,青年使用“大学”等词汇。总体而言,准确预测成年人更具挑战性。这些结果表明,检查语言和推特元数据特征以预测青少年和青年推特用户可能有助于为公共卫生监测和评估研究提供信息。