用于健康相关队列研究的推特个人资料中的自动性别检测

Automatic gender detection in Twitter profiles for health-related cohort studies.

作者信息

Yang Yuan-Chi, Al-Garadi Mohammed Ali, Love Jennifer S, Perrone Jeanmarie, Sarker Abeed

机构信息

Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, Georgia, USA.

Department of Emergency Medicine, School of Medicine, Oregon Health & Science University, Portland, Oregon, USA.

出版信息

JAMIA Open. 2021 Jun 23;4(2):ooab042. doi: 10.1093/jamiaopen/ooab042. eCollection 2021 Apr.

DOI:10.1093/jamiaopen/ooab042

PMID:34169232

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8220305/

Abstract

OBJECTIVE

Biomedical research involving social media data is gradually moving from population-level to targeted, cohort-level data analysis. Though crucial for biomedical studies, social media user's demographic information (eg, gender) is often not explicitly known from profiles. Here, we present an automatic gender classification system for social media and we illustrate how gender information can be incorporated into a social media-based health-related study.

MATERIALS AND METHODS

We used a large Twitter dataset composed of public, gender-labeled users (Dataset-1) for training and evaluating the gender detection pipeline. We experimented with machine learning algorithms including support vector machines (SVMs) and deep-learning models, and public packages including M3. We considered users' information including profile and tweets for classification. We also developed a meta-classifier ensemble that strategically uses the predicted scores from the classifiers. We then applied the best-performing pipeline to Twitter users who have self-reported nonmedical use of prescription medications (Dataset-2) to assess the system's utility.

RESULTS AND DISCUSSION

We collected 67 181 and 176 683 users for Dataset-1 and Dataset-2, respectively. A meta-classifier involving SVM and M3 performed the best (Dataset-1 accuracy: 94.4% [95% confidence interval: 94.0-94.8%]; Dataset-2: 94.4% [95% confidence interval: 92.0-96.6%]). Including automatically classified information in the analyses of Dataset-2 revealed gender-specific trends-proportions of females closely resemble data from the National Survey of Drug Use and Health 2018 (tranquilizers: 0.50 vs 0.50; stimulants: 0.50 vs 0.45), and the overdose Emergency Room Visit due to Opioids by Nationwide Emergency Department Sample (pain relievers: 0.38 vs 0.37).

CONCLUSION

Our publicly available, automated gender detection pipeline may aid cohort-specific social media data analyses (https://bitbucket.org/sarkerlab/gender-detection-for-public).

摘要

目的

涉及社交媒体数据的生物医学研究正逐渐从人群层面转向有针对性的队列层面数据分析。尽管社交媒体用户的人口统计学信息（如性别）对生物医学研究至关重要，但通常无法从个人资料中明确得知。在此，我们展示了一种用于社交媒体的自动性别分类系统，并阐述了如何将性别信息纳入基于社交媒体的健康相关研究。

材料与方法

我们使用了一个大型推特数据集（数据集1）进行训练和评估性别检测流程，该数据集由公开的、带有性别标签的用户组成。我们试验了包括支持向量机（SVM）和深度学习模型在内的机器学习算法，以及包括M3在内的公共软件包。我们将用户的资料和推文等信息用于分类。我们还开发了一个元分类器集成，策略性地使用分类器的预测分数。然后，我们将表现最佳的流程应用于自我报告非医疗用途处方药的推特用户（数据集2），以评估该系统的效用。

结果与讨论

我们分别为数据集1和数据集2收集了67181名和176683名用户。一个涉及SVM和M3的元分类器表现最佳（数据集1准确率：94.4%[95%置信区间：94.0 - 94.8%]；数据集2：94.4%[95%置信区间：92.0 - 96.6%]）。在对数据集2的分析中纳入自动分类信息后，揭示了特定性别的趋势——女性比例与2018年全国药物使用和健康调查的数据相近（镇静剂：0.50对0.50；兴奋剂：0.50对0.45），以及全国急诊科样本中因阿片类药物导致的过量急诊室就诊情况（止痛药：0.38对0.37）。

结论

我们公开可用的自动性别检测流程可能有助于特定队列的社交媒体数据分析（https://bitbucket.org/sarkerlab/gender-detection-for-public）。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于健康相关队列研究的推特个人资料中的自动性别检测

Automatic gender detection in Twitter profiles for health-related cohort studies.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS AND DISCUSSION

CONCLUSION

目的

材料与方法

结果与讨论

结论

相似文献

引用本文的文献

本文引用的文献

用于健康相关队列研究的推特个人资料中的自动性别检测

Automatic gender detection in Twitter profiles for health-related cohort studies.

作者信息

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS AND DISCUSSION

CONCLUSION

目的

材料与方法

结果与讨论

结论

相似文献

引用本文的文献

本文引用的文献