Suppr超能文献

用于基因组数据管理和集成工作流程的命名数据网络

Named Data Networking for Genomics Data Management and Integrated Workflows.

作者信息

Ogle Cameron, Reddick David, McKnight Coleman, Biggs Tyler, Pauly Rini, Ficklin Stephen P, Feltus F Alex, Shannigrahi Susmit

机构信息

School of Computing, Clemson University, Clemson, SC, United States.

Department of Computer Science, Tennessee Tech University, Cookeville, TN, United States.

出版信息

Front Big Data. 2021 Feb 15;4:582468. doi: 10.3389/fdata.2021.582468. eCollection 2021.

Abstract

Advanced imaging and DNA sequencing technologies now enable the diverse biology community to routinely generate and analyze terabytes of high resolution biological data. The community is rapidly heading toward the petascale in single investigator laboratory settings. As evidence, the single NCBI SRA central DNA sequence repository contains over 45 petabytes of biological data. Given the geometric growth of this and other genomics repositories, an exabyte of mineable biological data is imminent. The challenges of effectively utilizing these datasets are enormous as they are not only large in the size but also stored in geographically distributed repositories in various repositories such as National Center for Biotechnology Information (NCBI), DNA Data Bank of Japan (DDBJ), European Bioinformatics Institute (EBI), and NASA's GeneLab. In this work, we first systematically point out the data-management challenges of the genomics community. We then introduce Named Data Networking (NDN), a novel but well-researched Internet architecture, is capable of solving these challenges at the network layer. NDN performs all operations such as forwarding requests to data sources, content discovery, access, and retrieval using content names (that are similar to traditional filenames or filepaths) and eliminates the need for a location layer (the IP address) for data management. Utilizing NDN for genomics workflows simplifies data discovery, speeds up data retrieval using in-network caching of popular datasets, and allows the community to create infrastructure that supports operations such as creating federation of content repositories, retrieval from multiple sources, remote data subsetting, and others. Named based operations also streamlines deployment and integration of workflows with various cloud platforms. Our contributions in this work are as follows 1) we enumerate the cyberinfrastructure challenges of the genomics community that NDN can alleviate, and 2) we describe our efforts in applying NDN for a contemporary genomics workflow (GEMmaker) and quantify the improvements. The preliminary evaluation shows a sixfold speed up in data insertion into the workflow. 3) As a pilot, we have used an NDN naming scheme (agreed upon by the community and discussed in Section 4) to publish data from broadly used data repositories including the NCBI SRA. We have loaded the NDN testbed with these pre-processed genomes that can be accessed over NDN and used by anyone interested in those datasets. Finally, we discuss our continued effort in integrating NDN with cloud computing platforms, such as the Pacific Research Platform (PRP). The reader should note that the goal of this paper is to introduce NDN to the genomics community and discuss NDN's properties that can benefit the genomics community. We do not present an extensive performance evaluation of NDN-we are working on extending and evaluating our pilot deployment and will present systematic results in a future work.

摘要

先进的成像技术和DNA测序技术如今使生物科学界能够常规地生成和分析数万亿字节的高分辨率生物数据。在单个研究人员的实验室环境中,该领域正迅速朝着千万亿字节的规模迈进。证据是,单一的美国国家生物技术信息中心(NCBI)序列读取档案库(SRA)就包含超过45千万亿字节的生物数据。鉴于此以及其他基因组数据库呈几何级数增长,可挖掘的生物数据达到一艾字节指日可待。有效利用这些数据集面临着巨大挑战,因为它们不仅规模庞大,而且存储在诸如美国国家生物技术信息中心(NCBI)、日本DNA数据库(DDBJ)、欧洲生物信息学研究所(EBI)以及美国国家航空航天局(NASA)的基因实验室等不同地理位置的数据库中。在这项工作中,我们首先系统地指出了基因组学界在数据管理方面的挑战。然后我们介绍命名数据网络(NDN),这是一种新颖但经过充分研究的互联网架构,它能够在网络层解决这些挑战。NDN使用内容名称(类似于传统文件名或文件路径)执行所有操作,如将请求转发到数据源、内容发现、访问和检索,并且无需位置层(IP地址)进行数据管理。将NDN用于基因组工作流程可简化数据发现,通过对流行数据集进行网络内缓存来加速数据检索,并使该领域能够创建支持诸如创建内容存储库联盟、从多个源检索、远程数据子集化等操作的基础设施。基于命名的操作还简化了工作流程与各种云平台的部署和集成。我们在这项工作中的贡献如下:1)我们列举了NDN可以缓解的基因组学界的网络基础设施挑战;2)我们描述了将NDN应用于当代基因组工作流程(GEMmaker)的努力,并对改进效果进行了量化。初步评估表明,数据插入工作流程的速度提高了六倍。3)作为试点,我们使用了一种NDN命名方案(由该领域商定并在第4节中讨论)来发布来自包括NCBI SRA在内的广泛使用的数据库中的数据。我们已将这些预处理的基因组加载到NDN测试平台上,任何人对这些数据集感兴趣都可以通过NDN进行访问和使用。最后,我们讨论了我们在将NDN与云计算平台(如太平洋研究平台(PRP))集成方面的持续努力。读者应注意,本文的目的是向基因组学界介绍NDN,并讨论NDN对基因组学界有益的特性。我们没有对NDN进行广泛的性能评估——我们正在努力扩展和评估我们的试点部署,并将在未来的工作中展示系统的结果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9067/7968724/29c3eaf45905/fdata-04-582468-g003.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验