Suppr超能文献

使用无监督词嵌入和机器学习预测 COVID-19 文献中迅速涌现的主题:基于证据的研究。

Predicting Emerging Themes in Rapidly Expanding COVID-19 Literature With Unsupervised Word Embeddings and Machine Learning: Evidence-Based Study.

机构信息

Department of Computational Biology, Indraprastha Institute of Information Technology Delhi, New Delhi, India.

Maharaja Surajmal Institute of Technology, Guru Gobind Singh Indraprastha University, New Delhi, India.

出版信息

J Med Internet Res. 2022 Nov 2;24(11):e34067. doi: 10.2196/34067.

Abstract

BACKGROUND

Evidence from peer-reviewed literature is the cornerstone for designing responses to global threats such as COVID-19. In massive and rapidly growing corpuses, such as COVID-19 publications, assimilating and synthesizing information is challenging. Leveraging a robust computational pipeline that evaluates multiple aspects, such as network topological features, communities, and their temporal trends, can make this process more efficient.

OBJECTIVE

We aimed to show that new knowledge can be captured and tracked using the temporal change in the underlying unsupervised word embeddings of the literature. Further imminent themes can be predicted using machine learning on the evolving associations between words.

METHODS

Frequently occurring medical entities were extracted from the abstracts of more than 150,000 COVID-19 articles published on the World Health Organization database, collected on a monthly interval starting from February 2020. Word embeddings trained on each month's literature were used to construct networks of entities with cosine similarities as edge weights. Topological features of the subsequent month's network were forecasted based on prior patterns, and new links were predicted using supervised machine learning. Community detection and alluvial diagrams were used to track biomedical themes that evolved over the months.

RESULTS

We found that thromboembolic complications were detected as an emerging theme as early as August 2020. A shift toward the symptoms of long COVID complications was observed during March 2021, and neurological complications gained significance in June 2021. A prospective validation of the link prediction models achieved an area under the receiver operating characteristic curve of 0.87. Predictive modeling revealed predisposing conditions, symptoms, cross-infection, and neurological complications as dominant research themes in COVID-19 publications based on the patterns observed in previous months.

CONCLUSIONS

Machine learning-based prediction of emerging links can contribute toward steering research by capturing themes represented by groups of medical entities, based on patterns of semantic relationships over time.

摘要

背景

来自同行评议文献的证据是应对 COVID-19 等全球威胁的基石。在大规模且快速增长的语料库(如 COVID-19 文献)中,整合和综合信息具有挑战性。利用评估多个方面(如网络拓扑特征、社区及其时间趋势)的强大计算管道可以使这个过程更加高效。

目的

我们旨在展示可以使用文献中基础无监督词嵌入的时间变化来捕获和跟踪新知识。可以使用机器学习跟踪词之间不断发展的关联,从而预测即将出现的主题。

方法

从世界卫生组织数据库中收集的超过 150,000 篇 COVID-19 文章的摘要中提取经常出现的医学实体,从 2020 年 2 月开始每月间隔收集。使用每月文献训练的词嵌入来构建实体网络,余弦相似度作为边权重。基于先前的模式预测下一个月网络的拓扑特征,并使用有监督机器学习预测新的链接。使用社区检测和冲积图跟踪数月来演变的生物医学主题。

结果

我们发现血栓栓塞并发症早在 2020 年 8 月就被检测为新兴主题。2021 年 3 月观察到向长 COVID 并发症症状的转变,2021 年 6 月神经系统并发症变得重要。链接预测模型的前瞻性验证达到了接收器操作特征曲线下面积 0.87。预测模型根据前几个月观察到的模式,揭示了 COVID-19 文献中易患条件、症状、交叉感染和神经系统并发症等主要研究主题。

结论

基于随时间变化的语义关系模式,基于新兴链接的机器学习预测可以通过捕获由医学实体组表示的主题,为研究提供指导。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac9e/9629347/a3c11d075080/jmir_v24i11e34067_fig1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验