Micelio, Antwerpen, Belgium.
Department of Bioinformatics - BiGCaT, NUTRIM, Maastricht University, Maastricht, The Netherlands.
BMC Biol. 2021 Jan 22;19(1):12. doi: 10.1186/s12915-020-00940-y.
Pandemics, even more than other medical problems, require swift integration of knowledge. When caused by a new virus, understanding the underlying biology may help finding solutions. In a setting where there are a large number of loosely related projects and initiatives, we need common ground, also known as a "commons." Wikidata, a public knowledge graph aligned with Wikipedia, is such a commons and uses unique identifiers to link knowledge in other knowledge bases. However, Wikidata may not always have the right schema for the urgent questions. In this paper, we address this problem by showing how a data schema required for the integration can be modeled with entity schemas represented by Shape Expressions.
As a telling example, we describe the process of aligning resources on the genomes and proteomes of the SARS-CoV-2 virus and related viruses as well as how Shape Expressions can be defined for Wikidata to model the knowledge, helping others studying the SARS-CoV-2 pandemic. How this model can be used to make data between various resources interoperable is demonstrated by integrating data from NCBI (National Center for Biotechnology Information) Taxonomy, NCBI Genes, UniProt, and WikiPathways. Based on that model, a set of automated applications or bots were written for regular updates of these sources in Wikidata and added to a platform for automatically running these updates.
Although this workflow is developed and applied in the context of the COVID-19 pandemic, to demonstrate its broader applicability it was also applied to other human coronaviruses (MERS, SARS, human coronavirus NL63, human coronavirus 229E, human coronavirus HKU1, human coronavirus OC4).
大流行,甚于其他医学问题,更需要迅速整合知识。当由新病毒引起时,了解潜在生物学可能有助于找到解决方案。在存在大量松散相关项目和计划的环境中,我们需要一个共同点,也称为“公有领域”。Wikidata 是一个与维基百科对齐的公共知识图谱,它使用唯一标识符来链接其他知识库中的知识。然而,Wikidata 可能并不总是具有针对紧急问题的正确模式。在本文中,我们通过展示如何使用形状表达式表示的实体模式来对所需的集成数据模式进行建模来解决此问题。
作为一个典型的例子,我们描述了对齐 SARS-CoV-2 病毒及其相关病毒的基因组和蛋白质组资源的过程,以及如何为 Wikidata 定义形状表达式来对知识进行建模,以帮助研究 SARS-CoV-2 大流行的其他人。通过整合来自 NCBI(国家生物技术信息中心)分类学、NCBI Genes、UniProt 和 WikiPathways 的数据,展示了如何使用此模型使各种资源之间的数据实现互操作。基于该模型,编写了一组自动化应用程序或机器人,用于定期在 Wikidata 中更新这些来源,并将其添加到一个自动运行这些更新的平台中。
虽然此工作流程是在 COVID-19 大流行的背景下开发和应用的,但为了证明其更广泛的适用性,还将其应用于其他人类冠状病毒(MERS、SARS、人类冠状病毒 NL63、人类冠状病毒 229E、人类冠状病毒 HKU1、人类冠状病毒 OC4)。