• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

社交媒体挖掘在出生缺陷研究中的应用:一种基于规则和自举的方法,用于在 Twitter 上收集罕见健康相关事件的数据。

Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter.

机构信息

Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.

出版信息

J Biomed Inform. 2018 Nov;87:68-78. doi: 10.1016/j.jbi.2018.10.001. Epub 2018 Oct 4.

DOI:10.1016/j.jbi.2018.10.001
PMID:30292855
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6295660/
Abstract

BACKGROUND

Although birth defects are the leading cause of infant mortality in the United States, methods for observing human pregnancies with birth defect outcomes are limited.

OBJECTIVE

The primary objectives of this study were (i) to assess whether rare health-related events-in this case, birth defects-are reported on social media, (ii) to design and deploy a natural language processing (NLP) approach for collecting such sparse data from social media, and (iii) to utilize the collected data to discover a cohort of women whose pregnancies with birth defect outcomes could be observed on social media for epidemiological analysis.

METHODS

To assess whether birth defects are mentioned on social media, we mined 432 million tweets posted by 112,647 users who were automatically identified via their public announcements of pregnancy on Twitter. To retrieve tweets that mention birth defects, we developed a rule-based, bootstrapping approach, which relies on a lexicon, lexical variants generated from the lexicon entries, regular expressions, post-processing, and manual analysis guided by distributional properties. To identify users whose pregnancies with birth defect outcomes could be observed for epidemiological analysis, inclusion criteria were (i) tweets indicating that the user's child has a birth defect, and (ii) accessibility to the user's tweets during pregnancy. We conducted a semi-automatic evaluation to estimate the recall of the tweet-collection approach, and performed a preliminary assessment of the prevalence of selected birth defects among the pregnancy cohort derived from Twitter.

RESULTS

We manually annotated 16,822 retrieved tweets, distinguishing tweets indicating that the user's child has a birth defect (true positives) from tweets that merely mention birth defects (false positives). Inter-annotator agreement was substantial: κ = 0.79 (Cohen's kappa). Analyzing the timelines of the 646 users whose tweets were true positives resulted in the discovery of 195 users that met the inclusion criteria. Congenital heart defects are the most common type of birth defect reported on Twitter, consistent with findings in the general population. Based on an evaluation of 4169 tweets retrieved using alternative text mining methods, the recall of the tweet-collection approach was 0.95.

CONCLUSIONS

Our contributions include (i) evidence that rare health-related events are indeed reported on Twitter, (ii) a generalizable, systematic NLP approach for collecting sparse tweets, (iii) a semi-automatic method to identify undetected tweets (false negatives), and (iv) a collection of publicly available tweets by pregnant users with birth defect outcomes, which could be used for future epidemiological analysis. In future work, the annotated tweets could be used to train machine learning algorithms to automatically identify users reporting birth defect outcomes, enabling the large-scale use of social media mining as a complementary method for such epidemiological research.

摘要

背景

尽管出生缺陷是美国婴儿死亡的主要原因,但观察有出生缺陷结局的人类妊娠的方法有限。

目的

本研究的主要目的是:(i) 评估罕见的健康相关事件(在这种情况下为出生缺陷)是否在社交媒体上报告,(ii) 设计并部署一种自然语言处理 (NLP) 方法,从社交媒体中收集此类稀疏数据,以及 (iii) 利用收集到的数据发现一群可以在社交媒体上观察到有出生缺陷结局的妊娠的女性,以便进行流行病学分析。

方法

为了评估出生缺陷是否在社交媒体上被提及,我们挖掘了 4.32 亿条由 112,647 名用户发布的推文,这些用户通过在 Twitter 上自动发布怀孕公告被自动识别。为了检索提及出生缺陷的推文,我们开发了一种基于规则的自举方法,该方法依赖于词汇表、从词汇表条目中生成的词汇变体、正则表达式、后处理以及基于分布特性的手动分析。为了确定可以对有出生缺陷结局的妊娠进行流行病学分析的用户,纳入标准为 (i) 推文表明用户的孩子有出生缺陷,以及 (ii) 在妊娠期间可以访问用户的推文。我们进行了半自动评估,以估计推文收集方法的召回率,并对从 Twitter 中得出的妊娠队列中选定的出生缺陷的患病率进行了初步评估。

结果

我们手动注释了 16,822 条检索到的推文,将表明用户的孩子有出生缺陷的推文(真阳性)与仅提及出生缺陷的推文(假阳性)区分开来。注释者之间的一致性很高:κ=0.79(Cohen's kappa)。分析 646 名其推文为真阳性的用户的时间线,结果发现 195 名用户符合纳入标准。在 Twitter 上报告的最常见的出生缺陷类型是先天性心脏病,与一般人群中的发现一致。基于对使用替代文本挖掘方法检索到的 4169 条推文的评估,推文收集方法的召回率为 0.95。

结论

我们的贡献包括:(i) 确实有证据表明罕见的健康相关事件在 Twitter 上被报告,(ii) 一种可推广的、系统的用于收集稀疏推文的 NLP 方法,(iii) 一种半自动方法来识别未被发现的推文(假阴性),以及 (iv) 一组公开的有出生缺陷结局的孕妇推文,可用于未来的流行病学分析。在未来的工作中,注释后的推文可以用于训练机器学习算法,以自动识别报告出生缺陷结局的用户,从而使社交媒体挖掘作为此类流行病学研究的补充方法得以大规模应用。

相似文献

1
Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter.社交媒体挖掘在出生缺陷研究中的应用:一种基于规则和自举的方法,用于在 Twitter 上收集罕见健康相关事件的数据。
J Biomed Inform. 2018 Nov;87:68-78. doi: 10.1016/j.jbi.2018.10.001. Epub 2018 Oct 4.
2
A natural language processing pipeline to advance the use of Twitter data for digital epidemiology of adverse pregnancy outcomes.一种自然语言处理流程,以促进将推特数据用于不良妊娠结局的数字流行病学研究。
J Biomed Inform. 2020;112S:100076. doi: 10.1016/j.yjbinx.2020.100076. Epub 2020 Aug 8.
3
Towards scaling Twitter for digital epidemiology of birth defects.迈向扩大推特在出生缺陷数字流行病学中的应用规模。
NPJ Digit Med. 2019 Oct 1;2:96. doi: 10.1038/s41746-019-0170-5. eCollection 2019.
4
An annotated data set for identifying women reporting adverse pregnancy outcomes on Twitter.一个用于识别在推特上报告不良妊娠结局的女性的注释数据集。
Data Brief. 2020 Aug 31;32:106249. doi: 10.1016/j.dib.2020.106249. eCollection 2020 Oct.
5
Discovering Cohorts of Pregnant Women From Social Media for Safety Surveillance and Analysis.从社交媒体中发现孕妇群体以进行安全监测与分析。
J Med Internet Res. 2017 Oct 30;19(10):e361. doi: 10.2196/jmir.8164.
6
Using Twitter Data for Cohort Studies of Drug Safety in Pregnancy: Proof-of-concept With β-Blockers.利用推特数据进行孕期药物安全性队列研究:以β受体阻滞剂为例的概念验证
JMIR Form Res. 2022 Jun 30;6(6):e36771. doi: 10.2196/36771.
7
Identifying Patients With Inflammatory Bowel Disease on Twitter and Learning From Their Personal Experience: Retrospective Cohort Study.在 Twitter 上识别炎症性肠病患者并从他们的个人经验中学习:回顾性队列研究。
J Med Internet Res. 2022 Aug 2;24(8):e29186. doi: 10.2196/29186.
8
ReportAGE: Automatically extracting the exact age of Twitter users based on self-reports in tweets.ReportAGE:基于用户在推文中的自我报告自动提取 Twitter 用户的准确年龄。
PLoS One. 2022 Jan 25;17(1):e0262087. doi: 10.1371/journal.pone.0262087. eCollection 2022.
9
Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set.用于追踪 COVID-19 的 Twitter:自然语言处理管道和探索性数据集。
J Med Internet Res. 2021 Jan 22;23(1):e25314. doi: 10.2196/25314.
10
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

引用本文的文献

1
Challenges associated with delayed definitive diagnosis among Japanese patients with specific intractable diseases: A cross-sectional study.日本特定疑难病症患者延迟确诊相关挑战:一项横断面研究。
Intractable Rare Dis Res. 2023 Nov;12(4):213-221. doi: 10.5582/irdr.2023.01068.
2
A comparison of few-shot and traditional named entity recognition models for medical text.医学文本的少样本与传统命名实体识别模型比较
Proc (IEEE Int Conf Healthc Inform). 2022 Jun;2022:84-89. doi: 10.1109/ichi54592.2022.00024. Epub 2022 Sep 8.
3
Pregex: Rule-Based Detection and Extraction of Twitter Data in Pregnancy.Pregex:基于规则的孕期推特数据检测与提取
J Med Internet Res. 2023 Feb 9;25:e40569. doi: 10.2196/40569.
4
MonkeyPox2022Tweets: A Large-Scale Twitter Dataset on the 2022 Monkeypox Outbreak, Findings from Analysis of Tweets, and Open Research Questions.猴痘2022年推文:关于2022年猴痘疫情的大规模推特数据集、推文分析结果及开放性研究问题
Infect Dis Rep. 2022 Nov 14;14(6):855-883. doi: 10.3390/idr14060087.
5
Using Twitter Data for Cohort Studies of Drug Safety in Pregnancy: Proof-of-concept With β-Blockers.利用推特数据进行孕期药物安全性队列研究:以β受体阻滞剂为例的概念验证
JMIR Form Res. 2022 Jun 30;6(6):e36771. doi: 10.2196/36771.
6
Toward Using Twitter Data to Monitor COVID-19 Vaccine Safety in Pregnancy: Proof-of-Concept Study of Cohort Identification.利用推特数据监测孕期新冠病毒疫苗安全性:队列识别的概念验证研究
JMIR Form Res. 2022 Jan 6;6(1):e33792. doi: 10.2196/33792.
7
The role of machine learning applications in diagnosing and assessing critical and non-critical CHD: a scoping review.机器学习应用在诊断和评估危急和非危急 CHD 中的作用:范围综述。
Cardiol Young. 2021 Nov;31(11):1770-1780. doi: 10.1017/S1047951121004212. Epub 2021 Nov 2.
8
A natural language processing pipeline to advance the use of Twitter data for digital epidemiology of adverse pregnancy outcomes.一种自然语言处理流程,以促进将推特数据用于不良妊娠结局的数字流行病学研究。
J Biomed Inform. 2020;112S:100076. doi: 10.1016/j.yjbinx.2020.100076. Epub 2020 Aug 8.
9
Towards deep phenotyping pregnancy: a systematic review on artificial intelligence and machine learning methods to improve pregnancy outcomes.迈向深度妊娠表型研究:改善妊娠结局的人工智能和机器学习方法的系统评价。
Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbaa369.
10
An annotated data set for identifying women reporting adverse pregnancy outcomes on Twitter.一个用于识别在推特上报告不良妊娠结局的女性的注释数据集。
Data Brief. 2020 Aug 31;32:106249. doi: 10.1016/j.dib.2020.106249. eCollection 2020 Oct.

本文引用的文献

1
An unsupervised and customizable misspelling generator for mining noisy health-related text sources.一种用于挖掘噪声健康相关文本源的无监督和可定制的拼写错误生成器。
J Biomed Inform. 2018 Dec;88:98-107. doi: 10.1016/j.jbi.2018.11.007. Epub 2018 Nov 13.
2
A review of influenza detection and prediction through social networking sites.通过社交网站进行流感检测与预测的综述。
Theor Biol Med Model. 2018 Feb 1;15(1):2. doi: 10.1186/s12976-017-0074-5.
3
National substance use patterns on Twitter.推特上的全国药物使用模式。
PLoS One. 2017 Nov 6;12(11):e0187691. doi: 10.1371/journal.pone.0187691. eCollection 2017.
4
Discovering Cohorts of Pregnant Women From Social Media for Safety Surveillance and Analysis.从社交媒体中发现孕妇群体以进行安全监测与分析。
J Med Internet Res. 2017 Oct 30;19(10):e361. doi: 10.2196/jmir.8164.
5
Etiology and clinical presentation of birth defects: population based study.出生缺陷的病因与临床表现:基于人群的研究
BMJ. 2017 May 30;357:j2249. doi: 10.1136/bmj.j2249.
6
Risk Factors for Birth Defects.出生缺陷的危险因素。
Obstet Gynecol Surv. 2017 Feb;72(2):123-135. doi: 10.1097/OGX.0000000000000405.
7
A corpus for mining drug-related knowledge from Twitter chatter: Language models and their utilities.一个用于从推特聊天中挖掘药物相关知识的语料库:语言模型及其效用。
Data Brief. 2016 Nov 23;10:122-131. doi: 10.1016/j.dib.2016.11.056. eCollection 2017 Feb.
8
Performing research in pregnancy: Challenges and perspectives.孕期开展研究:挑战与展望。
Clin Dermatol. 2016 May-Jun;34(3):410-5. doi: 10.1016/j.clindermatol.2016.02.014. Epub 2016 Feb 11.
9
Population-based birth defects data in the United States, 2008 to 2012: Presentation of state-specific data and descriptive brief on variability of prevalence.2008年至2012年美国基于人群的出生缺陷数据:各州数据呈现及患病率变异性描述简报
Birth Defects Res A Clin Mol Teratol. 2015 Nov;103(11):972-93. doi: 10.1002/bdra.23461.
10
Systematic review on the prevalence, frequency and comparative value of adverse events data in social media.社交媒体中不良事件数据的患病率、发生频率及比较价值的系统评价。
Br J Clin Pharmacol. 2015 Oct;80(4):878-88. doi: 10.1111/bcp.12746. Epub 2015 Sep 16.