SAFE：具有访问控制的基于RDF数据立方体的SPARQL联邦。

SAFE: SPARQL Federation over RDF Data Cubes with Access Control.

作者信息

Khan Yasar, Saleem Muhammad, Mehdi Muntazir, Hogan Aidan, Mehmood Qaiser, Rebholz-Schuhmann Dietrich, Sahay Ratnesh

机构信息

Insight Centre for Data Analytics, NUIG, Galway, Ireland.

AKSW, University of Leipzig, Leipzig, Germany.

出版信息

J Biomed Semantics. 2017 Feb 1;8(1):5. doi: 10.1186/s13326-017-0112-6.

DOI:10.1186/s13326-017-0112-6

PMID:28148277

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5288952/

Abstract

BACKGROUND

Several query federation engines have been proposed for accessing public Linked Open Data sources. However, in many domains, resources are sensitive and access to these resources is tightly controlled by stakeholders; consequently, privacy is a major concern when federating queries over such datasets. In the Healthcare and Life Sciences (HCLS) domain real-world datasets contain sensitive statistical information: strict ownership is granted to individuals working in hospitals, research labs, clinical trial organisers, etc. Therefore, the legal and ethical concerns on (i) preserving the anonymity of patients (or clinical subjects); and (ii) respecting data ownership through access control; are key challenges faced by the data analytics community working within the HCLS domain. Likewise statistical data play a key role in the domain, where the RDF Data Cube Vocabulary has been proposed as a standard format to enable the exchange of such data. However, to the best of our knowledge, no existing approach has looked to optimise federated queries over such statistical data.

RESULTS

We present SAFE: a query federation engine that enables policy-aware access to sensitive statistical datasets represented as RDF data cubes. SAFE is designed specifically to query statistical RDF data cubes in a distributed setting, where access control is coupled with source selection, user profiles and their access rights. SAFE proposes a join-aware source selection method that avoids wasteful requests to irrelevant and unauthorised data sources. In order to preserve anonymity and enforce stricter access control, SAFE's indexing system does not hold any data instances-it stores only predicates and endpoints. The resulting data summary has a significantly lower index generation time and size compared to existing engines, which allows for faster updates when sources change.

CONCLUSIONS

We validate the performance of the system with experiments over real-world datasets provided by three clinical organisations as well as legacy linked datasets. We show that SAFE enables granular graph-level access control over distributed clinical RDF data cubes and efficiently reduces the source selection and overall query execution time when compared with general-purpose SPARQL query federation engines in the targeted setting.

摘要

背景

已经提出了几种查询联邦引擎来访问公共链接开放数据源。然而，在许多领域，资源是敏感的，对这些资源的访问受到利益相关者的严格控制；因此，在对此类数据集进行查询联邦时，隐私是一个主要问题。在医疗保健和生命科学（HCLS）领域，现实世界的数据集包含敏感的统计信息：严格的所有权授予在医院、研究实验室、临床试验组织者等工作的个人。因此，关于（i）保护患者（或临床受试者）的匿名性；以及（ii）通过访问控制尊重数据所有权的法律和道德问题，是在HCLS领域工作的数据分析社区面临的关键挑战。同样，统计数据在该领域发挥着关键作用，其中RDF数据立方体词汇已被提议作为一种标准格式，以实现此类数据的交换。然而，据我们所知，现有的方法都没有考虑优化对此类统计数据的联邦查询。

结果

我们提出了SAFE：一种查询联邦引擎，它能够对表示为RDF数据立方体的敏感统计数据集进行策略感知访问。SAFE专门设计用于在分布式环境中查询统计RDF数据立方体，其中访问控制与源选择、用户配置文件及其访问权限相结合。SAFE提出了一种连接感知源选择方法，可避免对不相关和未经授权的数据源进行浪费性请求。为了保护匿名性并实施更严格的访问控制，SAFE的索引系统不保存任何数据实例——它只存储谓词和端点。与现有引擎相比，生成的结果数据摘要的索引生成时间和大小显著更低，这使得在源发生变化时能够更快地更新。