Suppr超能文献

通过模式提取实现对私有数据存储库的即席重用。

Enabling ad-hoc reuse of private data repositories through schema extraction.

机构信息

Informatik 5, RWTH Aachen University, Ahornstr. 55, Aachen, 52062, Germany.

Fraunhofer FIT, Schloss Birlinghoven, Sankt Augustin, 53754, Germany.

出版信息

J Biomed Semantics. 2020 Jul 8;11(1):6. doi: 10.1186/s13326-020-00223-z.

Abstract

BACKGROUND

Sharing sensitive data across organizational boundaries is often significantly limited by legal and ethical restrictions. Regulations such as the EU General Data Protection Rules (GDPR) impose strict requirements concerning the protection of personal and privacy sensitive data. Therefore new approaches, such as the Personal Health Train initiative, are emerging to utilize data right in their original repositories, circumventing the need to transfer data.

RESULTS

Circumventing limitations of previous systems, this paper proposes a configurable and automated schema extraction and publishing approach, which enables ad-hoc SPARQL query formulation against RDF triple stores without requiring direct access to the private data. The approach is compatible with existing Semantic Web-based technologies and allows for the subsequent execution of such queries in a safe setting under the data provider's control. Evaluation with four distinct datasets shows that a configurable amount of concise and task-relevant schema, closely describing the structure of the underlying data, was derived, enabling the schema introspection-assisted authoring of SPARQL queries.

CONCLUSIONS

Automatically extracting and publishing data schema can enable the introspection-assisted creation of data selection and integration queries. In conjunction with the presented system architecture, this approach can enable reuse of data from private repositories and in settings where agreeing upon a shared schema and encoding a priori is infeasible. As such, it could provide an important step towards reuse of data from previously inaccessible sources and thus towards the proliferation of data-driven methods in the biomedical domain.

摘要

背景

在组织边界上共享敏感数据通常受到法律和道德限制的显著限制。欧盟通用数据保护条例(GDPR)等法规对个人和隐私敏感数据的保护提出了严格的要求。因此,新的方法,如个人健康培训倡议,正在涌现,以利用数据在其原始存储库中的权利,避免需要传输数据。

结果

规避了先前系统的局限性,本文提出了一种可配置和自动化的模式提取和发布方法,该方法允许针对 RDF 三元存储库的临时 SPARQL 查询的制定,而无需直接访问私人数据。该方法与现有的基于语义网的技术兼容,并允许在数据提供方控制下的安全环境中随后执行此类查询。使用四个不同的数据集进行评估表明,提取并发布了可配置数量的简洁且与任务相关的模式,这些模式紧密描述了底层数据的结构,从而能够辅助模式内省进行 SPARQL 查询的创作。

结论

自动提取和发布数据模式可以实现辅助数据选择和集成查询的内省创建。结合所提出的系统架构,这种方法可以实现对私有存储库中数据的重用,以及在难以达成共识的情况下共享模式和预先编码的情况下。因此,它可以为以前无法访问的数据源的数据重用提供重要的一步,从而促进生物医学领域数据驱动方法的普及。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e1a0/7341611/19f3e3129332/13326_2020_223_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验