Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China.
Department of Information Management, Peking University, Beijing 100871, China.
Genes (Basel). 2021 Jun 29;12(7):998. doi: 10.3390/genes12070998.
This study builds a coronavirus knowledge graph (KG) by merging two information sources. The first source is Analytical Graph (AG), which integrates more than 20 different public datasets related to drug discovery. The second source is CORD-19, a collection of published scientific articles related to COVID-19. We combined both chemo genomic entities in AG with entities extracted from CORD-19 to expand knowledge in the COVID-19 domain. Before populating KG with those entities, we perform entity disambiguation on CORD-19 collections using Wikidata. Our newly built KG contains at least 21,700 genes, 2500 diseases, 94,000 phenotypes, and other biological entities (e.g., compound, species, and cell lines). We define 27 relationship types and use them to label each edge in our KG. This research presents two cases to evaluate the KG's usability: analyzing a subgraph (ego-centered network) from the angiotensin-converting enzyme (ACE) and revealing paths between biological entities (hydroxychloroquine and IL-6 receptor; chloroquine and STAT1). The ego-centered network captured information related to COVID-19. We also found significant COVID-19-related information in top-ranked paths with a depth of three based on our path evaluation.
本研究通过合并两个信息源构建了冠状病毒知识图谱 (KG)。第一个信息源是分析图谱 (AG),它整合了 20 多个不同的与药物发现相关的公共数据集。第二个信息源是 CORD-19,它是一组与 COVID-19 相关的已发表科学文章的集合。我们将 AG 中的化学基因组实体与从 CORD-19 中提取的实体结合起来,以扩展 COVID-19 领域的知识。在将这些实体填充到 KG 之前,我们使用 Wikidata 对 CORD-19 集合进行实体消歧。我们新构建的 KG 至少包含 21700 个基因、2500 种疾病、94000 种表型和其他生物实体(例如化合物、物种和细胞系)。我们定义了 27 种关系类型,并使用它们来标记 KG 中的每条边。本研究提出了两个案例来评估 KG 的可用性:分析血管紧张素转换酶 (ACE) 的子图(以自我为中心的网络)和揭示生物实体之间的路径(羟氯喹和白细胞介素 6 受体;氯喹和 STAT1)。以自我为中心的网络捕获了与 COVID-19 相关的信息。我们还根据路径评估,在基于深度为三的最高排名路径中找到了重要的 COVID-19 相关信息。