• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

分割空白节点:在提高加载 RDF 数据效率的同时保持一致性。

Split4Blank: Maintaining consistency while improving efficiency of loading RDF data with blank nodes.

机构信息

Database Center for Life Science (DBCLS), Research Organization of Information and Systems, Kashiwa, Chiba, Japan.

出版信息

PLoS One. 2019 Jun 4;14(6):e0217852. doi: 10.1371/journal.pone.0217852. eCollection 2019.

DOI:10.1371/journal.pone.0217852
PMID:31163073
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6548388/
Abstract

In life sciences, accompanied by the rapid growth of sequencing technology and the advancement of research, vast amounts of data are being generated. It is known that as the size of Resource Description Framework (RDF) datasets increases, the more efficient loading to triple stores is crucial. For example, UniProt's RDF version contains 44 billion triples as of December 2018. PubChem also has an RDF dataset with 137 billion triples. As data sizes become extremely large, loading them to a triple store consumes time. To improve the efficiency of this task, parallel loading has been recommended for several stores. However, with parallel loading, dataset consistency must be considered if the dataset contains blank nodes. By definition, blank nodes do not have global identifiers; thus, pairs of identical blank nodes in the original dataset are recognized as different if they reside in separate files after the dataset is split for parallel loading. To address this issue, we propose the Split4Blank tool, which splits a dataset into multiple files under the condition that identical blank nodes are not separated. The proposed tool uses connected component and multiprocessor scheduling algorithms and satisfies the above condition. Furthermore, to confirm the effectiveness of the proposed approach, we applied Split4Blank to two life sciences RDF datasets. In addition, we generated synthetic RDF datasets to evaluate scalability based on the properties of various graphs, such as a scale-free and random graph.

摘要

在生命科学领域,随着测序技术的飞速发展和研究的不断推进,产生了大量的数据。众所周知,随着 RDF 数据集规模的增长,将其高效地加载到三元存储库中变得至关重要。例如,截至 2018 年 12 月,UniProt 的 RDF 版本包含 440 亿个三元组。PubChem 也有一个包含 1370 亿个三元组的 RDF 数据集。随着数据规模变得非常庞大,将它们加载到三元存储库中需要花费时间。为了提高这项任务的效率,已经推荐了几种存储库的并行加载。然而,在使用并行加载时,如果数据集包含空白节点,则必须考虑数据集的一致性。根据定义,空白节点没有全局标识符;因此,如果在将数据集分割为并行加载的多个文件后,原始数据集中相同的空白节点位于不同的文件中,那么它们将被视为不同的节点。为了解决这个问题,我们提出了 Split4Blank 工具,该工具可以在不分离相同空白节点的情况下将数据集分割成多个文件。所提出的工具使用连通分量和多处理器调度算法,并满足上述条件。此外,为了确认所提出方法的有效性,我们将 Split4Blank 应用于两个生命科学 RDF 数据集。此外,我们还生成了基于各种图的属性(例如无标度图和随机图)的合成 RDF 数据集,以评估可扩展性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb84/6548388/bd64289774b2/pone.0217852.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb84/6548388/34998878b7fc/pone.0217852.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb84/6548388/f88248f3ab7c/pone.0217852.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb84/6548388/39b0652d80c3/pone.0217852.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb84/6548388/5323d492d518/pone.0217852.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb84/6548388/4523eb62e21e/pone.0217852.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb84/6548388/a11a338aec8b/pone.0217852.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb84/6548388/bd64289774b2/pone.0217852.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb84/6548388/34998878b7fc/pone.0217852.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb84/6548388/f88248f3ab7c/pone.0217852.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb84/6548388/39b0652d80c3/pone.0217852.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb84/6548388/5323d492d518/pone.0217852.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb84/6548388/4523eb62e21e/pone.0217852.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb84/6548388/a11a338aec8b/pone.0217852.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb84/6548388/bd64289774b2/pone.0217852.g007.jpg

相似文献

1
Split4Blank: Maintaining consistency while improving efficiency of loading RDF data with blank nodes.分割空白节点:在提高加载 RDF 数据效率的同时保持一致性。
PLoS One. 2019 Jun 4;14(6):e0217852. doi: 10.1371/journal.pone.0217852. eCollection 2019.
2
NBDC RDF portal: a comprehensive repository for semantic data in life sciences.NBDC RDF 门户:生命科学中语义数据的综合知识库。
Database (Oxford). 2018 Jan 1;2018:bay123. doi: 10.1093/database/bay123.
3
Provenance Context Entity (PaCE): Scalable Provenance Tracking for Scientific RDF Data.出处上下文实体(PaCE):用于科学RDF数据的可扩展出处跟踪
Sci Stat Database Manag. 2010;6187:461-470. doi: 10.1007/978-3-642-13818-8_32.
4
gRDF: An Efficient Compressor with Reduced Structural Regularities That Utilizes gRePair.gRDF:一种利用gRePair减少结构规律性的高效压缩器。
Sensors (Basel). 2022 Mar 26;22(7):2545. doi: 10.3390/s22072545.
5
BioBenchmark Toyama 2012: an evaluation of the performance of triple stores on biological data.2012年富山生物基准测试:三元组存储在生物数据上的性能评估
J Biomed Semantics. 2014 Jul 10;5:32. doi: 10.1186/2041-1480-5-32. eCollection 2014.
6
cMapper: gene-centric connectivity mapper for EBI-RDF platform.cMapper:用于欧洲生物信息学研究所资源描述框架(EBI-RDF)平台的以基因为中心的连通性映射器。
Bioinformatics. 2017 Jan 15;33(2):266-271. doi: 10.1093/bioinformatics/btw612. Epub 2016 Sep 25.
7
YeastHub: a semantic web use case for integrating data in the life sciences domain.酵母中心:生命科学领域数据整合的语义网用例。
Bioinformatics. 2005 Jun;21 Suppl 1:i85-96. doi: 10.1093/bioinformatics/bti1026.
8
Toward a view-oriented approach for aligning RDF-based biomedical repositories.迈向一种基于视图的方法来对齐基于RDF的生物医学知识库。
Methods Inf Med. 2015;54(1):50-5. doi: 10.3414/ME13-02-0020. Epub 2014 Apr 29.
9
SAFE: SPARQL Federation over RDF Data Cubes with Access Control.SAFE:具有访问控制的基于RDF数据立方体的SPARQL联邦。
J Biomed Semantics. 2017 Feb 1;8(1):5. doi: 10.1186/s13326-017-0112-6.
10
A Preliminary Investigation of Reversing RML: From an RDF dataset to its Column-Based data source.关于逆转RML的初步调查:从RDF数据集到其基于列的数据源。
Biodivers Data J. 2015 Jul 29(3):e5464. doi: 10.3897/BDJ.3.e5464. eCollection 2015.

本文引用的文献

1
HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.HBLAST:并行化序列相似性——一种可通过Hadoop进行MapReduce的基本局部比对搜索工具。
J Biomed Inform. 2015 Apr;54:58-64. doi: 10.1016/j.jbi.2015.01.008. Epub 2015 Jan 24.
2
UniProt: a hub for protein information.通用蛋白质数据库(UniProt):蛋白质信息中心。
Nucleic Acids Res. 2015 Jan;43(Database issue):D204-12. doi: 10.1093/nar/gku989. Epub 2014 Oct 27.
3
Allie: a database and a search service of abbreviations and long forms.Allie:缩写和全称数据库及检索服务。
Database (Oxford). 2011 Apr 15;2011:bar013. doi: 10.1093/database/bar013. Print 2011.
4
Electrophysiological signatures of resting state networks in the human brain.人类大脑静息态网络的电生理特征。
Proc Natl Acad Sci U S A. 2007 Aug 7;104(32):13170-5. doi: 10.1073/pnas.0700668104. Epub 2007 Aug 1.
5
AlzPharm: integration of neurodegeneration data using RDF.阿尔茨海默病药物研发:使用资源描述框架(RDF)整合神经退行性变数据。
BMC Bioinformatics. 2007 May 9;8 Suppl 3(Suppl 3):S4. doi: 10.1186/1471-2105-8-S3-S4.
6
YeastHub: a semantic web use case for integrating data in the life sciences domain.酵母中心:生命科学领域数据整合的语义网用例。
Bioinformatics. 2005 Jun;21 Suppl 1:i85-96. doi: 10.1093/bioinformatics/bti1026.