分割空白节点：在提高加载 RDF 数据效率的同时保持一致性。

Split4Blank: Maintaining consistency while improving efficiency of loading RDF data with blank nodes.

机构信息

Database Center for Life Science (DBCLS), Research Organization of Information and Systems, Kashiwa, Chiba, Japan.

出版信息

PLoS One. 2019 Jun 4;14(6):e0217852. doi: 10.1371/journal.pone.0217852. eCollection 2019.

DOI:10.1371/journal.pone.0217852

PMID:31163073

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6548388/

Abstract

In life sciences, accompanied by the rapid growth of sequencing technology and the advancement of research, vast amounts of data are being generated. It is known that as the size of Resource Description Framework (RDF) datasets increases, the more efficient loading to triple stores is crucial. For example, UniProt's RDF version contains 44 billion triples as of December 2018. PubChem also has an RDF dataset with 137 billion triples. As data sizes become extremely large, loading them to a triple store consumes time. To improve the efficiency of this task, parallel loading has been recommended for several stores. However, with parallel loading, dataset consistency must be considered if the dataset contains blank nodes. By definition, blank nodes do not have global identifiers; thus, pairs of identical blank nodes in the original dataset are recognized as different if they reside in separate files after the dataset is split for parallel loading. To address this issue, we propose the Split4Blank tool, which splits a dataset into multiple files under the condition that identical blank nodes are not separated. The proposed tool uses connected component and multiprocessor scheduling algorithms and satisfies the above condition. Furthermore, to confirm the effectiveness of the proposed approach, we applied Split4Blank to two life sciences RDF datasets. In addition, we generated synthetic RDF datasets to evaluate scalability based on the properties of various graphs, such as a scale-free and random graph.

摘要

在生命科学领域，随着测序技术的飞速发展和研究的不断推进，产生了大量的数据。众所周知，随着 RDF 数据集规模的增长，将其高效地加载到三元存储库中变得至关重要。例如，截至 2018 年 12 月，UniProt 的 RDF 版本包含 440 亿个三元组。PubChem 也有一个包含 1370 亿个三元组的 RDF 数据集。随着数据规模变得非常庞大，将它们加载到三元存储库中需要花费时间。为了提高这项任务的效率，已经推荐了几种存储库的并行加载。然而，在使用并行加载时，如果数据集包含空白节点，则必须考虑数据集的一致性。根据定义，空白节点没有全局标识符；因此，如果在将数据集分割为并行加载的多个文件后，原始数据集中相同的空白节点位于不同的文件中，那么它们将被视为不同的节点。为了解决这个问题，我们提出了 Split4Blank 工具，该工具可以在不分离相同空白节点的情况下将数据集分割成多个文件。所提出的工具使用连通分量和多处理器调度算法，并满足上述条件。此外，为了确认所提出方法的有效性，我们将 Split4Blank 应用于两个生命科学 RDF 数据集。此外，我们还生成了基于各种图的属性（例如无标度图和随机图）的合成 RDF 数据集，以评估可扩展性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb84/6548388/34998878b7fc/pone.0217852.g001.jpg

相似文献

Split4Blank: Maintaining consistency while improving efficiency of loading RDF data with blank nodes.分割空白节点：在提高加载 RDF 数据效率的同时保持一致性。

PLoS One. 2019 Jun 4;14(6):e0217852. doi: 10.1371/journal.pone.0217852. eCollection 2019.

NBDC RDF portal: a comprehensive repository for semantic data in life sciences.NBDC RDF 门户：生命科学中语义数据的综合知识库。

Database (Oxford). 2018 Jan 1;2018:bay123. doi: 10.1093/database/bay123.

Provenance Context Entity (PaCE): Scalable Provenance Tracking for Scientific RDF Data.出处上下文实体（PaCE）：用于科学RDF数据的可扩展出处跟踪

Sci Stat Database Manag. 2010;6187:461-470. doi: 10.1007/978-3-642-13818-8_32.

gRDF: An Efficient Compressor with Reduced Structural Regularities That Utilizes gRePair.gRDF：一种利用gRePair减少结构规律性的高效压缩器。

Sensors (Basel). 2022 Mar 26;22(7):2545. doi: 10.3390/s22072545.

BioBenchmark Toyama 2012: an evaluation of the performance of triple stores on biological data.2012年富山生物基准测试：三元组存储在生物数据上的性能评估

J Biomed Semantics. 2014 Jul 10;5:32. doi: 10.1186/2041-1480-5-32. eCollection 2014.

cMapper: gene-centric connectivity mapper for EBI-RDF platform.cMapper：用于欧洲生物信息学研究所资源描述框架（EBI-RDF）平台的以基因为中心的连通性映射器。

Bioinformatics. 2017 Jan 15;33(2):266-271. doi: 10.1093/bioinformatics/btw612. Epub 2016 Sep 25.

YeastHub: a semantic web use case for integrating data in the life sciences domain.酵母中心：生命科学领域数据整合的语义网用例。

Bioinformatics. 2005 Jun;21 Suppl 1:i85-96. doi: 10.1093/bioinformatics/bti1026.

Toward a view-oriented approach for aligning RDF-based biomedical repositories.迈向一种基于视图的方法来对齐基于RDF的生物医学知识库。

Methods Inf Med. 2015;54(1):50-5. doi: 10.3414/ME13-02-0020. Epub 2014 Apr 29.

SAFE: SPARQL Federation over RDF Data Cubes with Access Control.SAFE：具有访问控制的基于RDF数据立方体的SPARQL联邦。

J Biomed Semantics. 2017 Feb 1;8(1):5. doi: 10.1186/s13326-017-0112-6.

A Preliminary Investigation of Reversing RML: From an RDF dataset to its Column-Based data source.关于逆转RML的初步调查：从RDF数据集到其基于列的数据源。

Biodivers Data J. 2015 Jul 29(3):e5464. doi: 10.3897/BDJ.3.e5464. eCollection 2015.

本文引用的文献

HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.HBLAST：并行化序列相似性——一种可通过Hadoop进行MapReduce的基本局部比对搜索工具。

J Biomed Inform. 2015 Apr;54:58-64. doi: 10.1016/j.jbi.2015.01.008. Epub 2015 Jan 24.

UniProt: a hub for protein information.通用蛋白质数据库（UniProt）：蛋白质信息中心。

Nucleic Acids Res. 2015 Jan;43(Database issue):D204-12. doi: 10.1093/nar/gku989. Epub 2014 Oct 27.

Allie: a database and a search service of abbreviations and long forms.Allie：缩写和全称数据库及检索服务。

Database (Oxford). 2011 Apr 15;2011:bar013. doi: 10.1093/database/bar013. Print 2011.

Electrophysiological signatures of resting state networks in the human brain.人类大脑静息态网络的电生理特征。

Proc Natl Acad Sci U S A. 2007 Aug 7;104(32):13170-5. doi: 10.1073/pnas.0700668104. Epub 2007 Aug 1.

AlzPharm: integration of neurodegeneration data using RDF.阿尔茨海默病药物研发：使用资源描述框架（RDF）整合神经退行性变数据。

BMC Bioinformatics. 2007 May 9;8 Suppl 3(Suppl 3):S4. doi: 10.1186/1471-2105-8-S3-S4.

YeastHub: a semantic web use case for integrating data in the life sciences domain.酵母中心：生命科学领域数据整合的语义网用例。

Bioinformatics. 2005 Jun;21 Suppl 1:i85-96. doi: 10.1093/bioinformatics/bti1026.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

分割空白节点：在提高加载 RDF 数据效率的同时保持一致性。

Split4Blank: Maintaining consistency while improving efficiency of loading RDF data with blank nodes.

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献