关于联邦知识图谱的大量生物信息学问题-查询对：方法与应用

A large collection of bioinformatics question-query pairs over federated knowledge graphs: methodology and applications.

作者信息

Bolleman Jerven, Emonet Vincent, Altenhoff Adrian, Bairoch Amos, Blatter Marie-Claude, Bridge Alan, Duvaud Séverine, Gasteiger Elisabeth, Kuznetsov Dmitry, Moretti Sébastien, Michel Pierre-Andre, Morgat Anne, Pagni Marco, Redaschi Nicole, Zahn-Zabal Monique, de Farias Tarcisio Mendes, Sima Ana Claudia

机构信息

SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland.

出版信息

Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf045.

DOI:10.1093/gigascience/giaf045

PMID:40378136

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12083453/

Abstract

BACKGROUND

In recent decades, several life science resources have structured data using the same framework and made these accessible using the same query language to facilitate interoperability. Knowledge graphs have seen increased adoption in bioinformatics due to their advantages for representing data in a generic graph format. For example, yummydata.org catalogs more than 60 knowledge graphs accessible through SPARQL, a technical query language. Although SPARQL allows powerful, expressive queries, even across physically distributed knowledge graphs, formulating such queries is a challenge for most users. Therefore, to guide users in retrieving the relevant data, many of these resources provide representative examples. These examples can also be an important source of information for machine learning (for example, machine-learning algorithms for translating natural language questions to SPARQL), if a sufficiently large number of examples are provided and published in a common, machine-readable, and standardized format across different resources.

FINDINGS

We introduce a large collection of human-written natural language questions and their corresponding SPARQL queries over federated bioinformatics knowledge graphs (KGs) collected for several years across different research groups at the SIB Swiss Institute of Bioinformatics. The collection comprises more than 1,000 example questions and queries, including almost 100 federated queries. We propose a methodology to uniformly represent the examples with minimal metadata, based on existing standards. Furthermore, we introduce an extensive set of open-source applications, including query graph visualizations and smart query editors, easily reusable by KG maintainers who adopt the proposed methodology.

CONCLUSIONS

We encourage the community to adopt and extend the proposed methodology, towards richer KG metadata and improved Semantic Web services. URL: https://github.com/sib-swiss/sparql-examples.

摘要

背景

近几十年来，一些生命科学资源使用相同的框架对数据进行结构化处理，并使用相同的查询语言使其可访问，以促进互操作性。知识图谱由于能够以通用图形格式表示数据的优势，在生物信息学中的应用越来越广泛。例如，yummydata.org编目了60多个可通过技术查询语言SPARQL访问的知识图谱。尽管SPARQL允许进行强大且富有表现力的查询，甚至可以跨物理分布的知识图谱进行查询，但对于大多数用户来说，制定这样的查询是一项挑战。因此，为了指导用户检索相关数据，这些资源中的许多都提供了代表性示例。如果能提供足够数量的示例并以通用、机器可读且标准化的格式在不同资源中发布，那么这些示例也可以成为机器学习的重要信息来源（例如，将自然语言问题转换为SPARQL的机器学习算法）。

研究结果

我们引入了大量人工编写的自然语言问题及其在联邦生物信息学知识图谱（KGs）上对应的SPARQL查询，这些问题和查询是瑞士生物信息学研究所（SIB）不同研究小组在几年间收集的。该集合包含1000多个示例问题和查询，其中包括近100个联邦查询。我们提出了一种方法，基于现有标准，用最少的元数据统一表示这些示例。此外，我们还引入了一套广泛的开源应用程序，包括查询图可视化和智能查询编辑器，采用所提出方法的KG维护者可以轻松重复使用。