Hasnain Ali, Mehmood Qaiser, Sana E Zainab Syeda, Saleem Muhammad, Warren Claude, Zehra Durre, Decker Stefan, Rebholz-Schuhmann Dietrich
Insight Centre for Data Analytics, National University of Ireland (NUIG), Galway, Ireland.
Universität Leipzig, IFI/AKSW, Leipzig, PO 100920, D-04009, Germany.
J Biomed Semantics. 2017 Mar 15;8(1):13. doi: 10.1186/s13326-017-0118-0.
Biomedical data, e.g. from knowledge bases and ontologies, is increasingly made available following open linked data principles, at best as RDF triple data. This is a necessary step towards unified access to biological data sets, but this still requires solutions to query multiple endpoints for their heterogeneous data to eventually retrieve all the meaningful information. Suggested solutions are based on query federation approaches, which require the submission of SPARQL queries to endpoints. Due to the size and complexity of available data, these solutions have to be optimised for efficient retrieval times and for users in life sciences research. Last but not least, over time, the reliability of data resources in terms of access and quality have to be monitored. Our solution (BioFed) federates data over 130 SPARQL endpoints in life sciences and tailors query submission according to the provenance information. BioFed has been evaluated against the state of the art solution FedX and forms an important benchmark for the life science domain.
The efficient cataloguing approach of the federated query processing system 'BioFed', the triple pattern wise source selection and the semantic source normalisation forms the core to our solution. It gathers and integrates data from newly identified public endpoints for federated access. Basic provenance information is linked to the retrieved data. Last but not least, BioFed makes use of the latest SPARQL standard (i.e., 1.1) to leverage the full benefits for query federation. The evaluation is based on 10 simple and 10 complex queries, which address data in 10 major and very popular data sources (e.g., Dugbank, Sider).
BioFed is a solution for a single-point-of-access for a large number of SPARQL endpoints providing life science data. It facilitates efficient query generation for data access and provides basic provenance information in combination with the retrieved data. BioFed fully supports SPARQL 1.1 and gives access to the endpoint's availability based on the EndpointData graph. Our evaluation of BioFed against FedX is based on 20 heterogeneous federated SPARQL queries and shows competitive execution performance in comparison to FedX, which can be attributed to the provision of provenance information for the source selection.
Developing and testing federated query engines for life sciences data is still a challenging task. According to our findings, it is advantageous to optimise the source selection. The cataloguing of SPARQL endpoints, including type and property indexing, leads to efficient querying of data resources over the Web of Data. This could even be further improved through the use of ontologies, e.g., for abstract normalisation of query terms.
生物医学数据,例如来自知识库和本体的数据,越来越多地按照开放链接数据原则提供,最好是以RDF三元组数据的形式。这是朝着统一访问生物数据集迈出的必要一步,但这仍然需要解决如何查询多个端点以获取其异构数据,最终检索所有有意义信息的问题。建议的解决方案基于查询联邦方法,该方法需要向端点提交SPARQL查询。由于可用数据的规模和复杂性,这些解决方案必须针对高效检索时间和生命科学研究中的用户进行优化。最后但同样重要的是,随着时间的推移,必须监控数据资源在访问和质量方面的可靠性。我们的解决方案(BioFed)联合了生命科学领域130多个SPARQL端点的数据,并根据出处信息定制查询提交。已将BioFed与最先进的解决方案FedX进行了评估,它构成了生命科学领域的一个重要基准。
联合查询处理系统“BioFed”的高效编目方法、三元组模式的源选择和语义源规范化构成了我们解决方案的核心。它收集并集成来自新识别的公共端点的数据以进行联合访问。基本出处信息与检索到的数据相关联。最后但同样重要的是,BioFed利用最新的SPARQL标准(即1.1)来充分利用查询联邦的优势。评估基于10个简单查询和10个复杂查询,这些查询涉及10个主要且非常流行的数据源(例如,Dugbank、Sider)中的数据。
BioFed是一种针对大量提供生命科学数据的SPARQL端点的单点访问解决方案。它便于高效生成数据访问查询,并结合检索到的数据提供基本出处信息。BioFed完全支持SPARQL 1.1,并基于EndpointData图提供端点的可用性信息。我们将BioFed与FedX进行的评估基于20个异构联合SPARQL查询,结果表明与FedX相比,BioFed具有有竞争力的执行性能,这可归因于为源选择提供了出处信息。
为生命科学数据开发和测试联合查询引擎仍然是一项具有挑战性的任务。根据我们的研究结果,优化源选择是有利的。对SPARQL端点进行编目,包括类型和属性索引,可实现通过数据网络对数据资源进行高效查询。通过使用本体,例如用于查询术语的抽象规范化,甚至可以进一步改进这一点。