Suppr
超能文献

BioFed：基于生命科学关联开放数据的联邦查询处理

BioFed: federated query processing over life sciences linked open data.

作者信息

Hasnain Ali, Mehmood Qaiser, Sana E Zainab Syeda, Saleem Muhammad, Warren Claude, Zehra Durre, Decker Stefan, Rebholz-Schuhmann Dietrich

机构信息

Insight Centre for Data Analytics, National University of Ireland (NUIG), Galway, Ireland.

Universität Leipzig, IFI/AKSW, Leipzig, PO 100920, D-04009, Germany.

出版信息

J Biomed Semantics. 2017 Mar 15;8(1):13. doi: 10.1186/s13326-017-0118-0.

DOI:10.1186/s13326-017-0118-0

PMID:28298238

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5353896/

Abstract

BACKGROUND

Biomedical data, e.g. from knowledge bases and ontologies, is increasingly made available following open linked data principles, at best as RDF triple data. This is a necessary step towards unified access to biological data sets, but this still requires solutions to query multiple endpoints for their heterogeneous data to eventually retrieve all the meaningful information. Suggested solutions are based on query federation approaches, which require the submission of SPARQL queries to endpoints. Due to the size and complexity of available data, these solutions have to be optimised for efficient retrieval times and for users in life sciences research. Last but not least, over time, the reliability of data resources in terms of access and quality have to be monitored. Our solution (BioFed) federates data over 130 SPARQL endpoints in life sciences and tailors query submission according to the provenance information. BioFed has been evaluated against the state of the art solution FedX and forms an important benchmark for the life science domain.

METHODS

The efficient cataloguing approach of the federated query processing system 'BioFed', the triple pattern wise source selection and the semantic source normalisation forms the core to our solution. It gathers and integrates data from newly identified public endpoints for federated access. Basic provenance information is linked to the retrieved data. Last but not least, BioFed makes use of the latest SPARQL standard (i.e., 1.1) to leverage the full benefits for query federation. The evaluation is based on 10 simple and 10 complex queries, which address data in 10 major and very popular data sources (e.g., Dugbank, Sider).

RESULTS

BioFed is a solution for a single-point-of-access for a large number of SPARQL endpoints providing life science data. It facilitates efficient query generation for data access and provides basic provenance information in combination with the retrieved data. BioFed fully supports SPARQL 1.1 and gives access to the endpoint's availability based on the EndpointData graph. Our evaluation of BioFed against FedX is based on 20 heterogeneous federated SPARQL queries and shows competitive execution performance in comparison to FedX, which can be attributed to the provision of provenance information for the source selection.

CONCLUSION

Developing and testing federated query engines for life sciences data is still a challenging task. According to our findings, it is advantageous to optimise the source selection. The cataloguing of SPARQL endpoints, including type and property indexing, leads to efficient querying of data resources over the Web of Data. This could even be further improved through the use of ontologies, e.g., for abstract normalisation of query terms.

摘要

背景

生物医学数据，例如来自知识库和本体的数据，越来越多地按照开放链接数据原则提供，最好是以RDF三元组数据的形式。这是朝着统一访问生物数据集迈出的必要一步，但这仍然需要解决如何查询多个端点以获取其异构数据，最终检索所有有意义信息的问题。建议的解决方案基于查询联邦方法，该方法需要向端点提交SPARQL查询。由于可用数据的规模和复杂性，这些解决方案必须针对高效检索时间和生命科学研究中的用户进行优化。最后但同样重要的是，随着时间的推移，必须监控数据资源在访问和质量方面的可靠性。我们的解决方案（BioFed）联合了生命科学领域130多个SPARQL端点的数据，并根据出处信息定制查询提交。已将BioFed与最先进的解决方案FedX进行了评估，它构成了生命科学领域的一个重要基准。

方法

联合查询处理系统“BioFed”的高效编目方法、三元组模式的源选择和语义源规范化构成了我们解决方案的核心。它收集并集成来自新识别的公共端点的数据以进行联合访问。基本出处信息与检索到的数据相关联。最后但同样重要的是，BioFed利用最新的SPARQL标准（即1.1）来充分利用查询联邦的优势。评估基于10个简单查询和10个复杂查询，这些查询涉及10个主要且非常流行的数据源（例如，Dugbank、Sider）中的数据。

结果

BioFed是一种针对大量提供生命科学数据的SPARQL端点的单点访问解决方案。它便于高效生成数据访问查询，并结合检索到的数据提供基本出处信息。BioFed完全支持SPARQL 1.1，并基于EndpointData图提供端点的可用性信息。我们将BioFed与FedX进行的评估基于20个异构联合SPARQL查询，结果表明与FedX相比，BioFed具有有竞争力的执行性能，这可归因于为源选择提供了出处信息。

结论

为生命科学数据开发和测试联合查询引擎仍然是一项具有挑战性的任务。根据我们的研究结果，优化源选择是有利的。对SPARQL端点进行编目，包括类型和属性索引，可实现通过数据网络对数据资源进行高效查询。通过使用本体，例如用于查询术语的抽象规范化，甚至可以进一步改进这一点。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8a09/5353896/7b6785543083/13326_2017_118_Fig1_HTML.jpg

相似文献

BioFed: federated query processing over life sciences linked open data.

J Biomed Semantics. 2017 Mar 15;8(1):13. doi: 10.1186/s13326-017-0118-0.

SAFE: SPARQL Federation over RDF Data Cubes with Access Control.

J Biomed Semantics. 2017 Feb 1;8(1):5. doi: 10.1186/s13326-017-0112-6.

TopFed: TCGA tailored federated query processing and linking to LOD.

J Biomed Semantics. 2014 Dec 3;5:47. doi: 10.1186/2041-1480-5-47. eCollection 2014.

A journey to Semantic Web query federation in the life sciences.

BMC Bioinformatics. 2009 Oct 1;10 Suppl 10(Suppl 10):S10. doi: 10.1186/1471-2105-10-S10-S10.

IDSM ChemWebRDF: SPARQLing small-molecule datasets.

J Cheminform. 2021 May 12;13(1):38. doi: 10.1186/s13321-021-00515-1.

Visualization Environment for Federated Knowledge Graphs: Development of an Interactive Biomedical Query Language and Web Application Interface.

JMIR Med Inform. 2020 Nov 23;8(11):e17964. doi: 10.2196/17964.

PIBAS FedSPARQL: a web-based platform for integration and exploration of bioinformatics datasets.

J Biomed Semantics. 2017 Sep 20;8(1):42. doi: 10.1186/s13326-017-0151-z.

Gauging triple stores with actual biological data.

BMC Bioinformatics. 2012 Jan 25;13 Suppl 1(Suppl 1):S3. doi: 10.1186/1471-2105-13-S1-S3.

SPARQL assist language-neutral query composer.

BMC Bioinformatics. 2012 Jan 25;13 Suppl 1(Suppl 1):S2. doi: 10.1186/1471-2105-13-S1-S2.

Federated ontology-based queries over cancer data.

BMC Bioinformatics. 2012 Jan 25;13 Suppl 1(Suppl 1):S9. doi: 10.1186/1471-2105-13-S1-S9.

引用本文的文献

Graph databases in systems biology: a systematic review.

Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae561.

Bio-SODA UX: enabling natural language question answering over knowledge graphs with user disambiguation.

Distrib Parallel Databases. 2022;40(2-3):409-440. doi: 10.1007/s10619-022-07414-w. Epub 2022 Jul 16.

Authors' attitude toward adopting a new workflow to improve the computability of phenotype publications.

Database (Oxford). 2022 Feb 2;2022. doi: 10.1093/database/baac001.

The Gene Ontology resource: enriching a GOld mine.

Nucleic Acids Res. 2021 Jan 8;49(D1):D325-D334. doi: 10.1093/nar/gkaa1113.

Enabling semantic queries across federated bioinformatics databases.

Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz106.

Integrative annotation and knowledge discovery of kinase post-translational modifications and cancer-associated mutations through federated protein ontologies and resources.

Sci Rep. 2018 Apr 25;8(1):6518. doi: 10.1038/s41598-018-24457-1.

PIBAS FedSPARQL: a web-based platform for integration and exploration of bioinformatics datasets.

J Biomed Semantics. 2017 Sep 20;8(1):42. doi: 10.1186/s13326-017-0151-z.

本文引用的文献

TopFed: TCGA tailored federated query processing and linking to LOD.

J Biomed Semantics. 2014 Dec 3;5:47. doi: 10.1186/2041-1480-5-47. eCollection 2014.

Automatically exposing OpenLifeData via SADI semantic Web Services.

J Biomed Semantics. 2014 Nov 19;5:46. doi: 10.1186/2041-1480-5-46. eCollection 2014.

BioBenchmark Toyama 2012: an evaluation of the performance of triple stores on biological data.

J Biomed Semantics. 2014 Jul 10;5:32. doi: 10.1186/2041-1480-5-32. eCollection 2014.

SPARQL assist language-neutral query composer.

BMC Bioinformatics. 2012 Jan 25;13 Suppl 1(Suppl 1):S2. doi: 10.1186/1471-2105-13-S1-S2.

A journey to Semantic Web query federation in the life sciences.

BMC Bioinformatics. 2009 Oct 1;10 Suppl 10(Suppl 10):S10. doi: 10.1186/1471-2105-10-S10-S10.

Data curation + process curation=data integration + science.

Brief Bioinform. 2008 Nov;9(6):506-17. doi: 10.1093/bib/bbn034. Epub 2008 Dec 6.

A Semantic Web management model for integrative biomedical informatics.

PLoS One. 2008 Aug 13;3(8):e2946. doi: 10.1371/journal.pone.0002946.

Integrating biological databases.

Nat Rev Genet. 2003 May;4(5):337-45. doi: 10.1038/nrg1065.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr超能文献

BioFed：基于生命科学关联开放数据的联邦查询处理

BioFed: federated query processing over life sciences linked open data.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSION

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译