• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于基因组数据管理和集成工作流程的命名数据网络

Named Data Networking for Genomics Data Management and Integrated Workflows.

作者信息

Ogle Cameron, Reddick David, McKnight Coleman, Biggs Tyler, Pauly Rini, Ficklin Stephen P, Feltus F Alex, Shannigrahi Susmit

机构信息

School of Computing, Clemson University, Clemson, SC, United States.

Department of Computer Science, Tennessee Tech University, Cookeville, TN, United States.

出版信息

Front Big Data. 2021 Feb 15;4:582468. doi: 10.3389/fdata.2021.582468. eCollection 2021.

DOI:10.3389/fdata.2021.582468
PMID:33748749
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7968724/
Abstract

Advanced imaging and DNA sequencing technologies now enable the diverse biology community to routinely generate and analyze terabytes of high resolution biological data. The community is rapidly heading toward the petascale in single investigator laboratory settings. As evidence, the single NCBI SRA central DNA sequence repository contains over 45 petabytes of biological data. Given the geometric growth of this and other genomics repositories, an exabyte of mineable biological data is imminent. The challenges of effectively utilizing these datasets are enormous as they are not only large in the size but also stored in geographically distributed repositories in various repositories such as National Center for Biotechnology Information (NCBI), DNA Data Bank of Japan (DDBJ), European Bioinformatics Institute (EBI), and NASA's GeneLab. In this work, we first systematically point out the data-management challenges of the genomics community. We then introduce Named Data Networking (NDN), a novel but well-researched Internet architecture, is capable of solving these challenges at the network layer. NDN performs all operations such as forwarding requests to data sources, content discovery, access, and retrieval using content names (that are similar to traditional filenames or filepaths) and eliminates the need for a location layer (the IP address) for data management. Utilizing NDN for genomics workflows simplifies data discovery, speeds up data retrieval using in-network caching of popular datasets, and allows the community to create infrastructure that supports operations such as creating federation of content repositories, retrieval from multiple sources, remote data subsetting, and others. Named based operations also streamlines deployment and integration of workflows with various cloud platforms. Our contributions in this work are as follows 1) we enumerate the cyberinfrastructure challenges of the genomics community that NDN can alleviate, and 2) we describe our efforts in applying NDN for a contemporary genomics workflow (GEMmaker) and quantify the improvements. The preliminary evaluation shows a sixfold speed up in data insertion into the workflow. 3) As a pilot, we have used an NDN naming scheme (agreed upon by the community and discussed in Section 4) to publish data from broadly used data repositories including the NCBI SRA. We have loaded the NDN testbed with these pre-processed genomes that can be accessed over NDN and used by anyone interested in those datasets. Finally, we discuss our continued effort in integrating NDN with cloud computing platforms, such as the Pacific Research Platform (PRP). The reader should note that the goal of this paper is to introduce NDN to the genomics community and discuss NDN's properties that can benefit the genomics community. We do not present an extensive performance evaluation of NDN-we are working on extending and evaluating our pilot deployment and will present systematic results in a future work.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9067/7968724/1e40f0a667d6/fdata-04-582468-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9067/7968724/29c3eaf45905/fdata-04-582468-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9067/7968724/3ba925c1f4eb/fdata-04-582468-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9067/7968724/acf523194104/fdata-04-582468-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9067/7968724/1e40f0a667d6/fdata-04-582468-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9067/7968724/29c3eaf45905/fdata-04-582468-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9067/7968724/3ba925c1f4eb/fdata-04-582468-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9067/7968724/acf523194104/fdata-04-582468-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9067/7968724/1e40f0a667d6/fdata-04-582468-g006.jpg
摘要

先进的成像技术和DNA测序技术如今使生物科学界能够常规地生成和分析数万亿字节的高分辨率生物数据。在单个研究人员的实验室环境中,该领域正迅速朝着千万亿字节的规模迈进。证据是,单一的美国国家生物技术信息中心(NCBI)序列读取档案库(SRA)就包含超过45千万亿字节的生物数据。鉴于此以及其他基因组数据库呈几何级数增长,可挖掘的生物数据达到一艾字节指日可待。有效利用这些数据集面临着巨大挑战,因为它们不仅规模庞大,而且存储在诸如美国国家生物技术信息中心(NCBI)、日本DNA数据库(DDBJ)、欧洲生物信息学研究所(EBI)以及美国国家航空航天局(NASA)的基因实验室等不同地理位置的数据库中。在这项工作中,我们首先系统地指出了基因组学界在数据管理方面的挑战。然后我们介绍命名数据网络(NDN),这是一种新颖但经过充分研究的互联网架构,它能够在网络层解决这些挑战。NDN使用内容名称(类似于传统文件名或文件路径)执行所有操作,如将请求转发到数据源、内容发现、访问和检索,并且无需位置层(IP地址)进行数据管理。将NDN用于基因组工作流程可简化数据发现,通过对流行数据集进行网络内缓存来加速数据检索,并使该领域能够创建支持诸如创建内容存储库联盟、从多个源检索、远程数据子集化等操作的基础设施。基于命名的操作还简化了工作流程与各种云平台的部署和集成。我们在这项工作中的贡献如下:1)我们列举了NDN可以缓解的基因组学界的网络基础设施挑战;2)我们描述了将NDN应用于当代基因组工作流程(GEMmaker)的努力,并对改进效果进行了量化。初步评估表明,数据插入工作流程的速度提高了六倍。3)作为试点,我们使用了一种NDN命名方案(由该领域商定并在第4节中讨论)来发布来自包括NCBI SRA在内的广泛使用的数据库中的数据。我们已将这些预处理的基因组加载到NDN测试平台上,任何人对这些数据集感兴趣都可以通过NDN进行访问和使用。最后,我们讨论了我们在将NDN与云计算平台(如太平洋研究平台(PRP))集成方面的持续努力。读者应注意,本文的目的是向基因组学界介绍NDN,并讨论NDN对基因组学界有益的特性。我们没有对NDN进行广泛的性能评估——我们正在努力扩展和评估我们的试点部署,并将在未来的工作中展示系统的结果。

相似文献

1
Named Data Networking for Genomics Data Management and Integrated Workflows.用于基因组数据管理和集成工作流程的命名数据网络
Front Big Data. 2021 Feb 15;4:582468. doi: 10.3389/fdata.2021.582468. eCollection 2021.
2
Named Data Networking for Content Delivery Network Workflows.用于内容分发网络工作流的命名数据网络
Proc IEEE Int Conf Cloud Netw. 2020 Nov;2021. doi: 10.1109/cloudnet51028.2020.9335806. Epub 2021 Feb 2.
3
A Dual-Connectivity Mobility Link Service for Producer Mobility in the Named Data Networking.基于命名数据网络的生产者移动性的双连接移动性链路服务。
Sensors (Basel). 2020 Aug 27;20(17):4859. doi: 10.3390/s20174859.
4
Comparison of Named Data Networking Mobility Methodology in a Merged Cloud Internet of Things and Artificial Intelligence Environment.在合并的云物联网和人工智能环境中比较命名数据网络移动性方法。
Sensors (Basel). 2022 Sep 3;22(17):6668. doi: 10.3390/s22176668.
5
LAFS: a learning-based adaptive forwarding strategy for NDN-based IoT networks.LAFS:一种用于基于命名数据网络(NDN)的物联网网络的基于学习的自适应转发策略。
Ann Telecommun. 2022;77(5-6):311-330. doi: 10.1007/s12243-021-00850-2. Epub 2021 Jul 14.
6
The Sequence Read Archive: explosive growth of sequencing data.序列读取档案:测序数据的爆炸式增长。
Nucleic Acids Res. 2012 Jan;40(Database issue):D54-6. doi: 10.1093/nar/gkr854. Epub 2011 Oct 18.
7
Context-Aware Naming and Forwarding in NDN-Based VANETs.基于 NDN 的车联网中的上下文感知命名与转发。
Sensors (Basel). 2021 Jul 6;21(14):4629. doi: 10.3390/s21144629.
8
Controller-driven vector autoregression model for predicting content popularity in programmable named data networking devices.用于预测可编程命名数据网络设备中内容流行度的控制器驱动向量自回归模型。
PeerJ Comput Sci. 2024 Feb 8;10:e1854. doi: 10.7717/peerj-cs.1854. eCollection 2024.
9
NINQ: Name-Integrated Query Framework for Named-Data Networking of Things.NINQ:用于物联网命名数据网络的名称集成查询框架。
Sensors (Basel). 2019 Jun 30;19(13):2906. doi: 10.3390/s19132906.
10
Popularity-Aware Closeness Based Caching in NDN Edge Networks.基于流行度感知贴近度的 NDN 边缘网络缓存
Sensors (Basel). 2022 May 2;22(9):3460. doi: 10.3390/s22093460.

引用本文的文献

1
Comparison of Named Data Networking Mobility Methodology in a Merged Cloud Internet of Things and Artificial Intelligence Environment.在合并的云物联网和人工智能环境中比较命名数据网络移动性方法。
Sensors (Basel). 2022 Sep 3;22(17):6668. doi: 10.3390/s22176668.

本文引用的文献

1
Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project.使用千人基因组计划第三阶段的数据在GRCh38装配上进行变异检测。
Wellcome Open Res. 2019 Dec 30;4:50. doi: 10.12688/wellcomeopenres.15126.2. eCollection 2019.
2
Database Resources of the National Genomics Data Center in 2020.2020 年国家基因库数据中心数据库资源。
Nucleic Acids Res. 2020 Jan 8;48(D1):D24-D33. doi: 10.1093/nar/gkz913.
3
Database resources of the National Center for Biotechnology Information.国家生物技术信息中心数据库资源。
Nucleic Acids Res. 2020 Jan 8;48(D1):D9-D16. doi: 10.1093/nar/gkz899.
4
Moving Just Enough Deep Sequencing Data to Get the Job Done.移动足够的深度测序数据以完成工作。
Bioinform Biol Insights. 2019 Jun 14;13:1177932219856359. doi: 10.1177/1177932219856359. eCollection 2019.
5
Linking Binary Gene Relationships to Drivers of Renal Cell Carcinoma Reveals Convergent Function in Alternate Tumor Progression Paths.将二元基因关系与肾细胞癌的驱动因素联系起来,揭示了不同肿瘤进展路径中趋同的功能。
Sci Rep. 2019 Feb 27;9(1):2899. doi: 10.1038/s41598-019-39875-y.
6
Next-Generation Sequencing Technologies.下一代测序技术。
Cold Spring Harb Perspect Med. 2019 Nov 1;9(11):a036798. doi: 10.1101/cshperspect.a036798.
7
The development of large-scale de-identified biomedical databases in the age of genomics-principles and challenges.基因组时代大规模去识别生物医学数据库的发展:原则与挑战。
Hum Genomics. 2018 Apr 10;12(1):19. doi: 10.1186/s40246-018-0147-5.
8
Discovery and validation of a glioblastoma co-expressed gene module.胶质母细胞瘤共表达基因模块的发现与验证
Oncotarget. 2018 Jan 13;9(13):10995-11008. doi: 10.18632/oncotarget.24228. eCollection 2018 Feb 16.
9
Ensembl 2018.Ensembl 2018.
Nucleic Acids Res. 2018 Jan 4;46(D1):D754-D761. doi: 10.1093/nar/gkx1098.
10
Discovering Condition-Specific Gene Co-Expression Patterns Using Gaussian Mixture Models: A Cancer Case Study.利用高斯混合模型发现条件特异性基因共表达模式:癌症案例研究。
Sci Rep. 2017 Aug 17;7(1):8617. doi: 10.1038/s41598-017-09094-4.