Yang Yuan-Chi, Al-Garadi Mohammed Ali, Love Jennifer S, Perrone Jeanmarie, Sarker Abeed
Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, Georgia, USA.
Department of Emergency Medicine, School of Medicine, Oregon Health & Science University, Portland, Oregon, USA.
JAMIA Open. 2021 Jun 23;4(2):ooab042. doi: 10.1093/jamiaopen/ooab042. eCollection 2021 Apr.
Biomedical research involving social media data is gradually moving from population-level to targeted, cohort-level data analysis. Though crucial for biomedical studies, social media user's demographic information (eg, gender) is often not explicitly known from profiles. Here, we present an automatic gender classification system for social media and we illustrate how gender information can be incorporated into a social media-based health-related study.
We used a large Twitter dataset composed of public, gender-labeled users (Dataset-1) for training and evaluating the gender detection pipeline. We experimented with machine learning algorithms including support vector machines (SVMs) and deep-learning models, and public packages including M3. We considered users' information including profile and tweets for classification. We also developed a meta-classifier ensemble that strategically uses the predicted scores from the classifiers. We then applied the best-performing pipeline to Twitter users who have self-reported nonmedical use of prescription medications (Dataset-2) to assess the system's utility.
We collected 67 181 and 176 683 users for Dataset-1 and Dataset-2, respectively. A meta-classifier involving SVM and M3 performed the best (Dataset-1 accuracy: 94.4% [95% confidence interval: 94.0-94.8%]; Dataset-2: 94.4% [95% confidence interval: 92.0-96.6%]). Including automatically classified information in the analyses of Dataset-2 revealed gender-specific trends-proportions of females closely resemble data from the National Survey of Drug Use and Health 2018 (tranquilizers: 0.50 vs 0.50; stimulants: 0.50 vs 0.45), and the overdose Emergency Room Visit due to Opioids by Nationwide Emergency Department Sample (pain relievers: 0.38 vs 0.37).
Our publicly available, automated gender detection pipeline may aid cohort-specific social media data analyses (https://bitbucket.org/sarkerlab/gender-detection-for-public).
涉及社交媒体数据的生物医学研究正逐渐从人群层面转向有针对性的队列层面数据分析。尽管社交媒体用户的人口统计学信息(如性别)对生物医学研究至关重要,但通常无法从个人资料中明确得知。在此,我们展示了一种用于社交媒体的自动性别分类系统,并阐述了如何将性别信息纳入基于社交媒体的健康相关研究。
我们使用了一个大型推特数据集(数据集1)进行训练和评估性别检测流程,该数据集由公开的、带有性别标签的用户组成。我们试验了包括支持向量机(SVM)和深度学习模型在内的机器学习算法,以及包括M3在内的公共软件包。我们将用户的资料和推文等信息用于分类。我们还开发了一个元分类器集成,策略性地使用分类器的预测分数。然后,我们将表现最佳的流程应用于自我报告非医疗用途处方药的推特用户(数据集2),以评估该系统的效用。
我们分别为数据集1和数据集2收集了67181名和176683名用户。一个涉及SVM和M3的元分类器表现最佳(数据集1准确率:94.4%[95%置信区间:94.0 - 94.8%];数据集2:94.4%[95%置信区间:92.0 - 96.6%])。在对数据集2的分析中纳入自动分类信息后,揭示了特定性别的趋势——女性比例与2018年全国药物使用和健康调查的数据相近(镇静剂:0.50对0.50;兴奋剂:0.50对0.45),以及全国急诊科样本中因阿片类药物导致的过量急诊室就诊情况(止痛药:0.38对0.37)。
我们公开可用的自动性别检测流程可能有助于特定队列的社交媒体数据分析(https://bitbucket.org/sarkerlab/gender-detection-for-public)。