Klein Ari Z, Gonzalez-Hernandez Graciela
University of Pennsylvania, Philadelphia, PA, USA.
Data Brief. 2020 Aug 31;32:106249. doi: 10.1016/j.dib.2020.106249. eCollection 2020 Oct.
Despite the prevalence in the United States of miscarriage [1], stillbirth [2], and infant mortality associated with preterm birth and low birthweight [3], their causes remain largely unknown [4], [5], [6]. To advance the use of social media data as a complementary resource for epidemiology of adverse pregnancy outcomes, we present a data set of 6487 tweets that mention miscarriage, stillbirth, preterm birth or premature labor, low birthweight, neonatal intensive care, or fetal/infant loss in general. These tweets are a subset of 22,912 tweets retrieved by applying hand-written regular expressions to a database containing more than 400 million public tweets posted by more than 100,000 women who have announced their pregnancy on Twitter [7]. Two professional annotators labeled the 6487 tweets in a binary fashion, distinguishing those potentially reporting that the user has personally experienced the outcome ("outcome" tweets) from those that merely mention the outcome ("non-outcome" tweets). Inter-annotator agreement was κ = 0.90 (Cohen's kappa). The tweets annotated as "outcome" include 1318 women reporting miscarriage, 94 stillbirth, 591 preterm birth or premature labor, 171 low birthweight, 453 neonatal intensive care, and 356 fetal/infant loss in general. These "outcome" tweets can be used to explore patient experiences and perceptions of adverse pregnancy outcomes, and can direct researchers to the users' broader timelines-tweets posted by a user over time-for observational studies. Our past work demonstrates the analysis of timelines for selecting a study population [8] and conducting a case-control study [9] of users reporting that their child has a birth defect. For larger-scale studies, the full annotated corpus can be used to train supervised machine learning algorithms to automatically identify additional users reporting adverse pregnancy outcomes on Twitter. We used the annotated corpus to train feature-engineered and deep learning-based classifiers presented in "A natural language processing pipeline to advance the use of Twitter data for digital epidemiology of adverse pregnancy outcomes" [10].
尽管在美国流产[1]、死产[2]以及与早产和低出生体重相关的婴儿死亡很常见[3],但其原因在很大程度上仍不为人知[4,5,6]。为了推动将社交媒体数据用作不良妊娠结局流行病学的补充资源,我们展示了一个包含6487条推文的数据集,这些推文提及流产、死产、早产或早产、低出生体重、新生儿重症监护或一般的胎儿/婴儿死亡。这些推文是通过将手写正则表达式应用于一个数据库而检索到的22912条推文的子集,该数据库包含超过1亿条由10万多名在推特上宣布怀孕的女性发布的公开推文[7]。两名专业注释者以二元方式对这6487条推文进行了标注,区分那些可能报告用户个人经历了该结局的推文(“结局”推文)和那些仅仅提及该结局的推文(“非结局”推文)。注释者间一致性为κ = 0.90(科恩kappa系数)。被标注为“结局”的推文包括1318名报告流产的女性、94例死产、591例早产或早产、171例低出生体重、453例新生儿重症监护以及356例一般的胎儿/婴儿死亡。这些“结局”推文可用于探索患者对不良妊娠结局的经历和看法,并能引导研究人员查看用户更广泛的时间线——用户随时间发布的推文——用于观察性研究。我们过去的工作展示了对时间线进行分析以选择研究人群[8]以及对报告孩子有出生缺陷的用户进行病例对照研究[9]。对于更大规模的研究,完整的注释语料库可用于训练监督机器学习算法,以自动识别推特上报告不良妊娠结局的其他用户。我们使用该注释语料库训练了《用于推进推特数据在不良妊娠结局数字流行病学中应用的自然语言处理管道》[10]中提出的基于特征工程和深度学习的分类器。