使用无监督词嵌入和机器学习预测 COVID-19 文献中迅速涌现的主题：基于证据的研究。

Predicting Emerging Themes in Rapidly Expanding COVID-19 Literature With Unsupervised Word Embeddings and Machine Learning: Evidence-Based Study.

机构信息

Department of Computational Biology, Indraprastha Institute of Information Technology Delhi, New Delhi, India.

Maharaja Surajmal Institute of Technology, Guru Gobind Singh Indraprastha University, New Delhi, India.

出版信息

J Med Internet Res. 2022 Nov 2;24(11):e34067. doi: 10.2196/34067.

DOI:10.2196/34067

PMID:36040993

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9629347/

Abstract

BACKGROUND

Evidence from peer-reviewed literature is the cornerstone for designing responses to global threats such as COVID-19. In massive and rapidly growing corpuses, such as COVID-19 publications, assimilating and synthesizing information is challenging. Leveraging a robust computational pipeline that evaluates multiple aspects, such as network topological features, communities, and their temporal trends, can make this process more efficient.

OBJECTIVE

We aimed to show that new knowledge can be captured and tracked using the temporal change in the underlying unsupervised word embeddings of the literature. Further imminent themes can be predicted using machine learning on the evolving associations between words.

METHODS

Frequently occurring medical entities were extracted from the abstracts of more than 150,000 COVID-19 articles published on the World Health Organization database, collected on a monthly interval starting from February 2020. Word embeddings trained on each month's literature were used to construct networks of entities with cosine similarities as edge weights. Topological features of the subsequent month's network were forecasted based on prior patterns, and new links were predicted using supervised machine learning. Community detection and alluvial diagrams were used to track biomedical themes that evolved over the months.

RESULTS

We found that thromboembolic complications were detected as an emerging theme as early as August 2020. A shift toward the symptoms of long COVID complications was observed during March 2021, and neurological complications gained significance in June 2021. A prospective validation of the link prediction models achieved an area under the receiver operating characteristic curve of 0.87. Predictive modeling revealed predisposing conditions, symptoms, cross-infection, and neurological complications as dominant research themes in COVID-19 publications based on the patterns observed in previous months.

CONCLUSIONS

Machine learning-based prediction of emerging links can contribute toward steering research by capturing themes represented by groups of medical entities, based on patterns of semantic relationships over time.

摘要

背景

来自同行评议文献的证据是应对 COVID-19 等全球威胁的基石。在大规模且快速增长的语料库（如 COVID-19 文献）中，整合和综合信息具有挑战性。利用评估多个方面（如网络拓扑特征、社区及其时间趋势）的强大计算管道可以使这个过程更加高效。

目的

我们旨在展示可以使用文献中基础无监督词嵌入的时间变化来捕获和跟踪新知识。可以使用机器学习跟踪词之间不断发展的关联，从而预测即将出现的主题。

方法

从世界卫生组织数据库中收集的超过 150,000 篇 COVID-19 文章的摘要中提取经常出现的医学实体，从 2020 年 2 月开始每月间隔收集。使用每月文献训练的词嵌入来构建实体网络，余弦相似度作为边权重。基于先前的模式预测下一个月网络的拓扑特征，并使用有监督机器学习预测新的链接。使用社区检测和冲积图跟踪数月来演变的生物医学主题。

结果

我们发现血栓栓塞并发症早在 2020 年 8 月就被检测为新兴主题。2021 年 3 月观察到向长 COVID 并发症症状的转变，2021 年 6 月神经系统并发症变得重要。链接预测模型的前瞻性验证达到了接收器操作特征曲线下面积 0.87。预测模型根据前几个月观察到的模式，揭示了 COVID-19 文献中易患条件、症状、交叉感染和神经系统并发症等主要研究主题。

结论

基于随时间变化的语义关系模式，基于新兴链接的机器学习预测可以通过捕获由医学实体组表示的主题，为研究提供指导。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ac9e/9629347/a3c11d075080/jmir_v24i11e34067_fig1.jpg

相似文献

Predicting Emerging Themes in Rapidly Expanding COVID-19 Literature With Unsupervised Word Embeddings and Machine Learning: Evidence-Based Study.使用无监督词嵌入和机器学习预测 COVID-19 文献中迅速涌现的主题：基于证据的研究。

J Med Internet Res. 2022 Nov 2;24(11):e34067. doi: 10.2196/34067.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale.生物概念向量：在大规模上创建和评估基于文献的生物医学概念嵌入。

PLoS Comput Biol. 2020 Apr 23;16(4):e1007617. doi: 10.1371/journal.pcbi.1007617. eCollection 2020 Apr.

Constructing co-occurrence network embeddings to assist association extraction for COVID-19 and other coronavirus infectious diseases.构建共现网络嵌入以辅助 COVID-19 和其他冠状病毒传染病的关联提取。

J Am Med Inform Assoc. 2020 Aug 1;27(8):1259-1267. doi: 10.1093/jamia/ocaa117.

Leveraging Temporal Trends for Training Contextual Word Embeddings to Address Bias in Biomedical Applications: Development Study.利用时间趋势训练上下文词嵌入以解决生物医学应用中的偏差：发展研究

JMIR AI. 2024 Oct 2;3:e49546. doi: 10.2196/49546.

Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks.基于文本挖掘的词表示在生物医学数据分析和机器学习任务中的蛋白质-蛋白质相互作用网络。

PLoS One. 2021 Oct 15;16(10):e0258623. doi: 10.1371/journal.pone.0258623. eCollection 2021.

Fine-Tuning Word Embeddings for Hierarchical Representation of Data Using a Corpus and a Knowledge Base for Various Machine Learning Applications.使用语料库和知识库对数据进行层次表示的词向量微调，用于各种机器学习应用。

Comput Math Methods Med. 2021 Nov 16;2021:9761163. doi: 10.1155/2021/9761163. eCollection 2021.

Learning predictive models of drug side-effect relationships from distributed representations of literature-derived semantic predications.从文献来源的语义断言的分布式表示中学习药物副作用关系的预测模型。

J Am Med Inform Assoc. 2018 Oct 1;25(10):1339-1350. doi: 10.1093/jamia/ocy077.

HPO2Vec+: Leveraging heterogeneous knowledge resources to enrich node embeddings for the Human Phenotype Ontology.HPO2Vec+：利用异构知识资源丰富人类表型本体的节点嵌入。

J Biomed Inform. 2019 Aug;96:103246. doi: 10.1016/j.jbi.2019.103246. Epub 2019 Jun 27.

Combining word embeddings to extract chemical and drug entities in biomedical literature.将词嵌入相结合以从生物医学文献中提取化学和药物实体。

BMC Bioinformatics. 2021 Dec 17;22(Suppl 1):599. doi: 10.1186/s12859-021-04188-3.

引用本文的文献

Coronavirus research topics, tracking twenty years of research.冠状病毒研究主题，追踪二十年研究历程。

Sci Data. 2025 Jun 10;12(1):978. doi: 10.1038/s41597-025-04992-z.

Artificial Intelligence in Surveillance, Diagnosis, Drug Discovery and Vaccine Development against COVID-19.人工智能在新冠疫情的监测、诊断、药物研发及疫苗开发中的应用

Pathogens. 2021 Aug 18;10(8):1048. doi: 10.3390/pathogens10081048.

本文引用的文献

Incidence, co-occurrence, and evolution of long-COVID features: A 6-month retrospective cohort study of 273,618 survivors of COVID-19.COVID-19 后 273618 例幸存者的 6 个月回顾性队列研究：长新冠症状的发生率、共病和演变。

PLoS Med. 2021 Sep 28;18(9):e1003773. doi: 10.1371/journal.pmed.1003773. eCollection 2021 Sep.

More than 50 long-term effects of COVID-19: a systematic review and meta-analysis.COVID-19 的 50 多种长期影响：系统评价和荟萃分析。

Sci Rep. 2021 Aug 9;11(1):16144. doi: 10.1038/s41598-021-95565-8.

Relation of prior statin and anti-hypertensive use to severity of disease among patients hospitalized with COVID-19: Findings from the American Heart Association's COVID-19 Cardiovascular Disease Registry.新冠肺炎住院患者中他汀类药物和抗高血压药物使用与疾病严重程度的关系：美国心脏协会新冠肺炎心血管疾病注册研究的结果。

PLoS One. 2021 Jul 15;16(7):e0254635. doi: 10.1371/journal.pone.0254635. eCollection 2021.

Long COVID in a prospective cohort of home-isolated patients.长新冠在居家隔离患者前瞻性队列中的表现。

Nat Med. 2021 Sep;27(9):1607-1613. doi: 10.1038/s41591-021-01433-3. Epub 2021 Jun 23.

Statins in patients with COVID-19: a retrospective cohort study in Iranian COVID-19 patients.新冠病毒疾病患者使用他汀类药物：一项针对伊朗新冠病毒疾病患者的回顾性队列研究

Transl Med Commun. 2021;6(1):3. doi: 10.1186/s41231-021-00082-5. Epub 2021 Jan 25.

Understanding the temporal evolution of COVID-19 research through machine learning and natural language processing.通过机器学习和自然语言处理理解新冠病毒研究的时间演变。

Scientometrics. 2021;126(1):725-739. doi: 10.1007/s11192-020-03744-7. Epub 2020 Nov 19.

Biomedical named entity recognition using deep neural networks with contextual information.基于上下文信息的深度神经网络的生物医学命名实体识别。

BMC Bioinformatics. 2019 Dec 27;20(1):735. doi: 10.1186/s12859-019-3321-4.

Unsupervised word embeddings capture latent knowledge from materials science literature.无监督词嵌入方法可以从材料科学文献中提取潜在知识。

Nature. 2019 Jul;571(7763):95-98. doi: 10.1038/s41586-019-1335-8. Epub 2019 Jul 3.

LSTM: A Search Space Odyssey.长短期记忆网络：搜索空间奥德赛。

IEEE Trans Neural Netw Learn Syst. 2017 Oct;28(10):2222-2232. doi: 10.1109/TNNLS.2016.2582924. Epub 2016 Jul 8.

Etymologia: Bonferroni correction.词源：邦费罗尼校正。

Emerg Infect Dis. 2015 Feb;21(2):289. doi: 10.3201/eid2102.et2102.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用无监督词嵌入和机器学习预测 COVID-19 文献中迅速涌现的主题：基于证据的研究。

Predicting Emerging Themes in Rapidly Expanding COVID-19 Literature With Unsupervised Word Embeddings and Machine Learning: Evidence-Based Study.

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献