Suppr超能文献

从社交媒体中发现孕妇群体以进行安全监测与分析。

Discovering Cohorts of Pregnant Women From Social Media for Safety Surveillance and Analysis.

作者信息

Sarker Abeed, Chandrashekar Pramod, Magge Arjun, Cai Haitao, Klein Ari, Gonzalez Graciela

机构信息

Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.

Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ, United States.

出版信息

J Med Internet Res. 2017 Oct 30;19(10):e361. doi: 10.2196/jmir.8164.

Abstract

BACKGROUND

Pregnancy exposure registries are the primary sources of information about the safety of maternal usage of medications during pregnancy. Such registries enroll pregnant women in a voluntary fashion early on in pregnancy and follow them until the end of pregnancy or longer to systematically collect information regarding specific pregnancy outcomes. Although the model of pregnancy registries has distinct advantages over other study designs, they are faced with numerous challenges and limitations such as low enrollment rate, high cost, and selection bias.

OBJECTIVE

The primary objectives of this study were to systematically assess whether social media (Twitter) can be used to discover cohorts of pregnant women and to develop and deploy a natural language processing and machine learning pipeline for the automatic collection of cohort information. In addition, we also attempted to ascertain, in a preliminary fashion, what types of longitudinal information may potentially be mined from the collected cohort information.

METHODS

Our discovery of pregnant women relies on detecting pregnancy-indicating tweets (PITs), which are statements posted by pregnant women regarding their pregnancies. We used a set of 14 patterns to first detect potential PITs. We manually annotated a sample of 14,156 of the retrieved user posts to distinguish real PITs from false positives and trained a supervised classification system to detect real PITs. We optimized the classification system via cross validation, with features and settings targeted toward optimizing precision for the positive class. For users identified to be posting real PITs via automatic classification, our pipeline collected all their available past and future posts from which other information (eg, medication usage and fetal outcomes) may be mined.

RESULTS

Our rule-based PIT detection approach retrieved over 200,000 posts over a period of 18 months. Manual annotation agreement for three annotators was very high at kappa (κ)=.79. On a blind test set, the implemented classifier obtained an overall F score of 0.84 (0.88 for the pregnancy class and 0.68 for the nonpregnancy class). Precision for the pregnancy class was 0.93, and recall was 0.84. Feature analysis showed that the combination of dense and sparse vectors for classification achieved optimal performance. Employing the trained classifier resulted in the identification of 71,954 users from the collected posts. Over 250 million posts were retrieved for these users, which provided a multitude of longitudinal information about them.

CONCLUSIONS

Social media sources such as Twitter can be used to identify large cohorts of pregnant women and to gather longitudinal information via automated processing of their postings. Considering the many drawbacks and limitations of pregnancy registries, social media mining may provide beneficial complementary information. Although the cohort sizes identified over social media are large, future research will have to assess the completeness of the information available through them.

摘要

背景

孕期暴露登记处是关于孕期母亲用药安全性信息的主要来源。此类登记处以自愿方式在孕期早期招募孕妇,并对她们进行跟踪直至孕期结束或更长时间,以系统收集有关特定妊娠结局的信息。尽管孕期登记模式相较于其他研究设计具有明显优势,但它们面临着诸多挑战和限制,如低登记率、高成本和选择偏倚。

目的

本研究的主要目的是系统评估社交媒体(推特)是否可用于发现孕妇队列,并开发和部署自然语言处理及机器学习流程以自动收集队列信息。此外,我们还试图初步确定从收集到的队列信息中可能挖掘出哪些类型的纵向信息。

方法

我们对孕妇的发现依赖于检测表明怀孕的推文(PIT),即孕妇发布的关于其怀孕情况的陈述。我们使用一组14种模式首先检测潜在的PIT。我们对检索到的14156条用户帖子样本进行人工标注,以区分真正的PIT和误报,并训练一个监督分类系统来检测真正的PIT。我们通过交叉验证优化分类系统,其特征和设置旨在优化阳性类别的精度。对于通过自动分类被确定为发布真正PIT的用户,我们的流程收集他们所有可用的过往和未来帖子,从中可以挖掘其他信息(如用药情况和胎儿结局)。

结果

我们基于规则的PIT检测方法在18个月内检索到超过200000条帖子。三位标注员的人工标注一致性在kappa(κ)=.79时非常高。在一个盲测集上,实施的分类器获得的总体F分数为0.84(怀孕类别为0.88,非怀孕类别为0.68)。怀孕类别的精度为0.93,召回率为0.84。特征分析表明,用于分类的密集和稀疏向量的组合实现了最佳性能。使用经过训练的分类器从收集到的帖子中识别出71954名用户。为这些用户检索到超过2.5亿条帖子,这些帖子提供了关于他们的大量纵向信息。

结论

推特等社交媒体来源可用于识别大量孕妇队列,并通过对其帖子的自动处理收集纵向信息。考虑到孕期登记处存在的诸多缺点和限制,社交媒体挖掘可能提供有益的补充信息。尽管通过社交媒体识别出的队列规模很大,但未来的研究将必须评估通过它们可获得信息的完整性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4537/5684515/ec74258c1548/jmir_v19i10e361_fig1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验