Feltus Frank A, Breen Joseph R, Deng Juan, Izard Ryan S, Konger Christopher A, Ligon Walter B, Preuss Don, Wang Kuang-Ching
Department of Genetics and Biochemistry, Clemson University, Clemson, SC, USA.
University of Utah Center for High Performance Computing, Salt Lake City, UT, USA.
Bioinform Biol Insights. 2015 Sep 23;9(Suppl 1):9-19. doi: 10.4137/BBI.S28988. eCollection 2015.
In the last decade, high-throughput DNA sequencing has become a disruptive technology and pushed the life sciences into a distributed ecosystem of sequence data producers and consumers. Given the power of genomics and declining sequencing costs, biology is an emerging "Big Data" discipline that will soon enter the exabyte data range when all subdisciplines are combined. These datasets must be transferred across commercial and research networks in creative ways since sending data without thought can have serious consequences on data processing time frames. Thus, it is imperative that biologists, bioinformaticians, and information technology engineers recalibrate data processing paradigms to fit this emerging reality. This review attempts to provide a snapshot of Big Data transfer across networks, which is often overlooked by many biologists. Specifically, we discuss four key areas: 1) data transfer networks, protocols, and applications; 2) data transfer security including encryption, access, firewalls, and the Science DMZ; 3) data flow control with software-defined networking; and 4) data storage, staging, archiving and access. A primary intention of this article is to orient the biologist in key aspects of the data transfer process in order to frame their genomics-oriented needs to enterprise IT professionals.
在过去十年中,高通量DNA测序已成为一项颠覆性技术,并将生命科学推向了一个由序列数据生产者和消费者组成的分布式生态系统。鉴于基因组学的强大力量和测序成本的下降,生物学正在成为一门新兴的“大数据”学科,当所有子学科的数据合并在一起时,很快将进入艾字节数据范围。由于不加思考地传输数据可能会对数据处理时间框架产生严重影响,因此这些数据集必须以创造性的方式通过商业和研究网络进行传输。因此,生物学家、生物信息学家和信息技术工程师必须重新调整数据处理范式,以适应这一新兴现实。本综述试图提供一幅网络间大数据传输的快照,而这往往被许多生物学家所忽视。具体而言,我们将讨论四个关键领域:1)数据传输网络、协议和应用程序;2)数据传输安全,包括加密、访问、防火墙和科学非军事区;3)通过软件定义网络进行的数据流控制;以及4)数据存储、暂存、存档和访问。本文的主要目的是让生物学家了解数据传输过程的关键方面,以便向企业IT专业人员阐述他们以基因组学为导向的需求。