Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, Minnesota, USA.
Division of Digital Health Sciences, Mayo Clinic, Rochester, Minnesota, USA.
J Am Med Inform Assoc. 2020 Aug 1;27(8):1259-1267. doi: 10.1093/jamia/ocaa117.
As coronavirus disease 2019 (COVID-19) started its rapid emergence and gradually transformed into an unprecedented pandemic, the need for having a knowledge repository for the disease became crucial. To address this issue, a new COVID-19 machine-readable dataset known as the COVID-19 Open Research Dataset (CORD-19) has been released. Based on this, our objective was to build a computable co-occurrence network embeddings to assist association detection among COVID-19-related biomedical entities.
Leveraging a Linked Data version of CORD-19 (ie, CORD-19-on-FHIR), we first utilized SPARQL to extract co-occurrences among chemicals, diseases, genes, and mutations and build a co-occurrence network. We then trained the representation of the derived co-occurrence network using node2vec with 4 edge embeddings operations (L1, L2, Average, and Hadamard). Six algorithms (decision tree, logistic regression, support vector machine, random forest, naïve Bayes, and multilayer perceptron) were applied to evaluate performance on link prediction. An unsupervised learning strategy was also developed incorporating the t-SNE (t-distributed stochastic neighbor embedding) and DBSCAN (density-based spatial clustering of applications with noise) algorithms for case studies.
The random forest classifier showed the best performance on link prediction across different network embeddings. For edge embeddings generated using the Average operation, random forest achieved the optimal average precision of 0.97 along with a F1 score of 0.90. For unsupervised learning, 63 clusters were formed with silhouette score of 0.128. Significant associations were detected for 5 coronavirus infectious diseases in their corresponding subgroups.
In this study, we constructed COVID-19-centered co-occurrence network embeddings. Results indicated that the generated embeddings were able to extract significant associations for COVID-19 and coronavirus infectious diseases.
随着 2019 年冠状病毒病(COVID-19)的迅速出现并逐渐演变为前所未有的大流行,对疾病知识库的需求变得至关重要。为了解决这个问题,一个新的 COVID-19 机器可读数据集,即 COVID-19 开放研究数据集(CORD-19)已经发布。在此基础上,我们的目标是构建可计算的共现网络嵌入,以协助 COVID-19 相关生物医学实体之间的关联检测。
利用 CORD-19 的 Linked Data 版本(即 CORD-19-on-FHIR),我们首先使用 SPARQL 提取化学物质、疾病、基因和突变之间的共现,并构建共现网络。然后,我们使用 node2vec 训练所得共现网络的表示,共进行了 4 次边嵌入操作(L1、L2、Average 和 Hadamard)。我们应用了 6 种算法(决策树、逻辑回归、支持向量机、随机森林、朴素贝叶斯和多层感知机)来评估链接预测的性能。还开发了一种无监督学习策略,结合 t-SNE(t 分布随机邻域嵌入)和 DBSCAN(基于密度的空间聚类应用噪声)算法进行案例研究。
随机森林分类器在不同网络嵌入上的链接预测表现最佳。对于使用 Average 操作生成的边嵌入,随机森林在平均精度达到 0.97 的同时,F1 得分为 0.90。对于无监督学习,形成了 63 个簇,轮廓得分 0.128。在其相应的子组中,检测到 5 种冠状病毒传染病的显著关联。
在这项研究中,我们构建了以 COVID-19 为中心的共现网络嵌入。结果表明,生成的嵌入能够提取 COVID-19 和冠状病毒传染病的显著关联。