Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA 30322.
Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37232.
Proc Natl Acad Sci U S A. 2023 Feb 21;120(8):e2207391120. doi: 10.1073/pnas.2207391120. Epub 2023 Feb 14.
Traditional substance use (SU) surveillance methods, such as surveys, incur substantial lags. Due to the continuously evolving trends in SU, insights obtained via such methods are often outdated. Social media-based sources have been proposed for obtaining timely insights, but methods leveraging such data cannot typically provide fine-grained statistics about subpopulations, unlike traditional approaches. We address this gap by developing methods for automatically characterizing a large Twitter nonmedical prescription medication use (NPMU) cohort (n = 288,562) in terms of age-group, race, and gender. Our natural language processing and machine learning methods for automated cohort characterization achieved 0.88 precision (95% CI:0.84 to 0.92) for age-group, 0.90 (95% CI: 0.85 to 0.95) for race, and 94% accuracy (95% CI: 92 to 97) for gender, when evaluated against manually annotated gold-standard data. We compared automatically derived statistics for NPMU of tranquilizers, stimulants, and opioids from Twitter with statistics reported in the National Survey on Drug Use and Health (NSDUH) and the National Emergency Department Sample (NEDS). Distributions automatically estimated from Twitter were mostly consistent with the NSDUH [Spearman : race: 0.98 (< 0.005); age-group: 0.67 (< 0.005); gender: 0.66 (= 0.27)] and NEDS, with 34/65 (52.3%) of the Twitter-based estimates lying within 95% CIs of estimates from the traditional sources. Explainable differences (e.g., overrepresentation of younger people) were found for age-group-related statistics. Our study demonstrates that accurate subpopulation-specific estimates about SU, particularly NPMU, may be automatically derived from Twitter to obtain earlier insights about targeted subpopulations compared to traditional surveillance approaches.
传统的物质使用 (SU) 监测方法,如调查,会产生大量的滞后。由于 SU 的趋势不断发展,通过这些方法获得的见解往往已经过时。已经提出了基于社交媒体的来源来获取及时的见解,但与传统方法相比,利用这些数据的方法通常无法提供关于亚人群的精细统计数据。我们通过开发方法来自动描述一个大型 Twitter 非医疗处方药物使用 (NPMU) 队列(n = 288,562)在年龄组、种族和性别方面的特征,解决了这一差距。我们的用于自动队列特征描述的自然语言处理和机器学习方法在年龄组方面达到了 0.88 的精度(95%CI:0.84 至 0.92),在种族方面达到了 0.90(95%CI:0.85 至 0.95),在性别方面达到了 94%的准确率(95%CI:92 至 97),与手动注释的黄金标准数据进行评估。我们比较了从 Twitter 上自动推导出的镇静剂、兴奋剂和阿片类药物的 NPMU 的统计数据与国家药物使用和健康调查 (NSDUH) 和国家急症部门样本 (NEDS) 报告的统计数据。从 Twitter 自动估计的分布与 NSDUH 大多一致 [Spearman:种族:0.98(<0.005);年龄组:0.67(<0.005);性别:0.66(=0.27)] 和 NEDS,在 65 个中的 34 个(52.3%)基于 Twitter 的估计值位于传统来源估计值的 95%CI 内。对于年龄组相关的统计数据,发现了可解释的差异(例如,年轻人的代表性过高)。我们的研究表明,与传统监测方法相比,从 Twitter 上可能自动推导出关于 SU 的准确的亚人群特定估计值,特别是 NPMU,以便更早地了解目标亚人群的情况。