Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.
J Biomed Inform. 2018 Nov;87:68-78. doi: 10.1016/j.jbi.2018.10.001. Epub 2018 Oct 4.
Although birth defects are the leading cause of infant mortality in the United States, methods for observing human pregnancies with birth defect outcomes are limited.
The primary objectives of this study were (i) to assess whether rare health-related events-in this case, birth defects-are reported on social media, (ii) to design and deploy a natural language processing (NLP) approach for collecting such sparse data from social media, and (iii) to utilize the collected data to discover a cohort of women whose pregnancies with birth defect outcomes could be observed on social media for epidemiological analysis.
To assess whether birth defects are mentioned on social media, we mined 432 million tweets posted by 112,647 users who were automatically identified via their public announcements of pregnancy on Twitter. To retrieve tweets that mention birth defects, we developed a rule-based, bootstrapping approach, which relies on a lexicon, lexical variants generated from the lexicon entries, regular expressions, post-processing, and manual analysis guided by distributional properties. To identify users whose pregnancies with birth defect outcomes could be observed for epidemiological analysis, inclusion criteria were (i) tweets indicating that the user's child has a birth defect, and (ii) accessibility to the user's tweets during pregnancy. We conducted a semi-automatic evaluation to estimate the recall of the tweet-collection approach, and performed a preliminary assessment of the prevalence of selected birth defects among the pregnancy cohort derived from Twitter.
We manually annotated 16,822 retrieved tweets, distinguishing tweets indicating that the user's child has a birth defect (true positives) from tweets that merely mention birth defects (false positives). Inter-annotator agreement was substantial: κ = 0.79 (Cohen's kappa). Analyzing the timelines of the 646 users whose tweets were true positives resulted in the discovery of 195 users that met the inclusion criteria. Congenital heart defects are the most common type of birth defect reported on Twitter, consistent with findings in the general population. Based on an evaluation of 4169 tweets retrieved using alternative text mining methods, the recall of the tweet-collection approach was 0.95.
Our contributions include (i) evidence that rare health-related events are indeed reported on Twitter, (ii) a generalizable, systematic NLP approach for collecting sparse tweets, (iii) a semi-automatic method to identify undetected tweets (false negatives), and (iv) a collection of publicly available tweets by pregnant users with birth defect outcomes, which could be used for future epidemiological analysis. In future work, the annotated tweets could be used to train machine learning algorithms to automatically identify users reporting birth defect outcomes, enabling the large-scale use of social media mining as a complementary method for such epidemiological research.
尽管出生缺陷是美国婴儿死亡的主要原因,但观察有出生缺陷结局的人类妊娠的方法有限。
本研究的主要目的是:(i) 评估罕见的健康相关事件(在这种情况下为出生缺陷)是否在社交媒体上报告,(ii) 设计并部署一种自然语言处理 (NLP) 方法,从社交媒体中收集此类稀疏数据,以及 (iii) 利用收集到的数据发现一群可以在社交媒体上观察到有出生缺陷结局的妊娠的女性,以便进行流行病学分析。
为了评估出生缺陷是否在社交媒体上被提及,我们挖掘了 4.32 亿条由 112,647 名用户发布的推文,这些用户通过在 Twitter 上自动发布怀孕公告被自动识别。为了检索提及出生缺陷的推文,我们开发了一种基于规则的自举方法,该方法依赖于词汇表、从词汇表条目中生成的词汇变体、正则表达式、后处理以及基于分布特性的手动分析。为了确定可以对有出生缺陷结局的妊娠进行流行病学分析的用户,纳入标准为 (i) 推文表明用户的孩子有出生缺陷,以及 (ii) 在妊娠期间可以访问用户的推文。我们进行了半自动评估,以估计推文收集方法的召回率,并对从 Twitter 中得出的妊娠队列中选定的出生缺陷的患病率进行了初步评估。
我们手动注释了 16,822 条检索到的推文,将表明用户的孩子有出生缺陷的推文(真阳性)与仅提及出生缺陷的推文(假阳性)区分开来。注释者之间的一致性很高:κ=0.79(Cohen's kappa)。分析 646 名其推文为真阳性的用户的时间线,结果发现 195 名用户符合纳入标准。在 Twitter 上报告的最常见的出生缺陷类型是先天性心脏病,与一般人群中的发现一致。基于对使用替代文本挖掘方法检索到的 4169 条推文的评估,推文收集方法的召回率为 0.95。
我们的贡献包括:(i) 确实有证据表明罕见的健康相关事件在 Twitter 上被报告,(ii) 一种可推广的、系统的用于收集稀疏推文的 NLP 方法,(iii) 一种半自动方法来识别未被发现的推文(假阴性),以及 (iv) 一组公开的有出生缺陷结局的孕妇推文,可用于未来的流行病学分析。在未来的工作中,注释后的推文可以用于训练机器学习算法,以自动识别报告出生缺陷结局的用户,从而使社交媒体挖掘作为此类流行病学研究的补充方法得以大规模应用。