TIB Leibniz Information Centre for Science and Technology, Welfengarten 1B, 30167, Hanover, Germany.
J Biomed Semantics. 2021 Nov 25;12(1):20. doi: 10.1186/s13326-021-00254-0.
The size, velocity, and heterogeneity of Big Data outclasses conventional data management tools and requires data and metadata to be fully machine-actionable (i.e., eScience-compliant) and thus findable, accessible, interoperable, and reusable (FAIR). This can be achieved by using ontologies and through representing them as semantic graphs. Here, we discuss two different semantic graph approaches of representing empirical data and metadata in a knowledge graph, with phenotype descriptions as an example. Almost all phenotype descriptions are still being published as unstructured natural language texts, with far-reaching consequences for their FAIRness, substantially impeding their overall usability within the life sciences. However, with an increasing amount of anatomy ontologies becoming available and semantic applications emerging, a solution to this problem becomes available. Researchers are starting to document and communicate phenotype descriptions through the Web in the form of highly formalized and structured semantic graphs that use ontology terms and Uniform Resource Identifiers (URIs) to circumvent the problems connected with unstructured texts.
Using phenotype descriptions as an example, we compare and evaluate two basic representations of empirical data and their accompanying metadata in the form of semantic graphs: the class-based TBox semantic graph approach called Semantic Phenotype and the instance-based ABox semantic graph approach called Phenotype Knowledge Graph. Their main difference is that only the ABox approach allows for identifying every individual part and property mentioned in the description in a knowledge graph. This technical difference results in substantial practical consequences that significantly affect the overall usability of empirical data. The consequences affect findability, accessibility, and explorability of empirical data as well as their comparability, expandability, universal usability and reusability, and overall machine-actionability. Moreover, TBox semantic graphs often require querying under entailment regimes, which is computationally more complex.
We conclude that, from a conceptual point of view, the advantages of the instance-based ABox semantic graph approach outweigh its shortcomings and outweigh the advantages of the class-based TBox semantic graph approach. Therefore, we recommend the instance-based ABox approach as a FAIR approach for documenting and communicating empirical data and metadata in a knowledge graph.
大数据的规模、速度和异质性超过了传统的数据管理工具,需要将数据和元数据完全实现机器可操作(即符合 eScience 标准),从而实现可查找、可访问、可互操作和可重用(FAIR)。这可以通过使用本体并将其表示为语义图来实现。在这里,我们讨论了两种不同的语义图方法,用于在知识图中表示经验数据和元数据,以表型描述为例。几乎所有的表型描述仍然以非结构化的自然语言文本形式发布,这对其 FAIR 性产生了深远的影响,极大地阻碍了它们在生命科学中的整体可用性。然而,随着越来越多的解剖学本体可用,以及语义应用的出现,这个问题的解决方案也随之出现。研究人员开始以高度形式化和结构化的语义图的形式,通过网络记录和交流表型描述,这些语义图使用本体术语和统一资源标识符(URIs)来规避与非结构化文本相关的问题。
以表型描述为例,我们比较和评估了两种以语义图形式表示经验数据及其伴随元数据的基本表示方法:称为语义表型的基于类的 TBox 语义图方法和称为表型知识图的基于实例的 ABox 语义图方法。它们的主要区别在于,只有 ABox 方法允许在知识图中标识描述中提到的每个个体部分和属性。这种技术差异导致了实质性的实际后果,这些后果显著影响了经验数据的整体可用性。这些后果影响了经验数据的可查找性、可访问性和可探索性,以及它们的可比性、可扩展性、普遍可用性和可重用性,以及整体的机器可操作性。此外,TBox 语义图通常需要在蕴涵规则下进行查询,这在计算上更加复杂。
从概念的角度来看,我们得出结论,基于实例的 ABox 语义图方法的优势超过了其缺点,也超过了基于类的 TBox 语义图方法的优势。因此,我们推荐基于实例的 ABox 方法作为在知识图中记录和交流经验数据和元数据的 FAIR 方法。