用实际的生物数据评估三元存储库。

Gauging triple stores with actual biological data.

出版信息

BMC Bioinformatics. 2012 Jan 25;13 Suppl 1(Suppl 1):S3. doi: 10.1186/1471-2105-13-S1-S3.

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3471352/

Abstract

BACKGROUND

Semantic Web technologies have been developed to overcome the limitations of the current Web and conventional data integration solutions. The Semantic Web is expected to link all the data present on the Internet instead of linking just documents. One of the foundations of the Semantic Web technologies is the knowledge representation language Resource Description Framework (RDF). Knowledge expressed in RDF is typically stored in so-called triple stores (also known as RDF stores), from which it can be retrieved with SPARQL, a language designed for querying RDF-based models. The Semantic Web technologies should allow federated queries over multiple triple stores. In this paper we compare the efficiency of a set of biologically relevant queries as applied to a number of different triple store implementations.

RESULTS

Previously we developed a library of queries to guide the use of our knowledge base Cell Cycle Ontology implemented as a triple store. We have now compared the performance of these queries on five non-commercial triple stores: OpenLink Virtuoso (Open-Source Edition), Jena SDB, Jena TDB, SwiftOWLIM and 4Store. We examined three performance aspects: the data uploading time, the query execution time and the scalability. The queries we had chosen addressed diverse ontological or biological questions, and we found that individual store performance was quite query-specific. We identified three groups of queries displaying similar behaviour across the different stores: 1) relatively short response time queries, 2) moderate response time queries and 3) relatively long response time queries. SwiftOWLIM proved to be a winner in the first group, 4Store in the second one and Virtuoso in the third one.

CONCLUSIONS

Our analysis showed that some queries behaved idiosyncratically, in a triple store specific manner, mainly with SwiftOWLIM and 4Store. Virtuoso, as expected, displayed a very balanced performance - its load time and its response time for all the tested queries were better than average among the selected stores; it showed a very good scalability and a reasonable run-to-run reproducibility. Jena SDB and Jena TDB were consistently slower than the other three implementations. Our analysis demonstrated that most queries developed for Virtuoso could be successfully used for other implementations.

摘要

背景

语义网技术旨在克服当前网络和传统数据集成解决方案的局限性。语义网有望将互联网上的所有数据链接起来，而不仅仅是文档。语义网技术的基础之一是知识表示语言资源描述框架（RDF）。用 RDF 表示的知识通常存储在所谓的三元存储库（也称为 RDF 存储库）中，可以使用 SPARQL 从这些存储库中检索，SPARQL 是一种用于查询基于 RDF 的模型的语言。语义网技术应允许在多个三元存储库上进行联合查询。在本文中，我们比较了一组与生物学相关的查询在应用于多种不同的三元存储库实现时的效率。

结果

我们之前开发了一组查询来指导使用我们的知识库细胞周期本体论实现作为三元存储库。现在，我们比较了这些查询在五个非商业三元存储库上的性能：OpenLink Virtuoso（开源版）、Jena SDB、Jena TDB、SwiftOWLIM 和 4Store。我们检查了三个性能方面：数据上传时间、查询执行时间和可扩展性。我们选择的查询解决了不同的本体论或生物学问题，我们发现单个存储库的性能非常特定于查询。我们确定了三组表现出相似行为的查询：1）响应时间较短的查询，2）响应时间适中的查询，3）响应时间较长的查询。SwiftOWLIM 在第一组中表现出色，4Store 在第二组中表现出色，Virtuoso 在第三组中表现出色。

结论

我们的分析表明，一些查询以特定于三元存储库的方式表现出特殊行为，主要是 SwiftOWLIM 和 4Store。Virtuoso 表现出非常平衡的性能-与所选存储库相比，其加载时间和对所有测试查询的响应时间都优于平均水平；它显示出非常好的可扩展性和合理的运行间可重复性。Jena SDB 和 Jena TDB 的速度始终比其他三个实现慢。我们的分析表明，为 Virtuoso 开发的大多数查询都可以成功用于其他实现。