Watts Nicholas A, Feltus Frank A
Clemson Computing & Information Technology.
Clemson University Department of Genetics & Biochemistry, Clemson, SC 29634, USA.
Bioinformatics. 2017 Feb 15;33(4):627-628. doi: 10.1093/bioinformatics/btw679.
The ability to centralize and store data for long periods on an end user's computational resources is increasingly difficult for many scientific disciplines. For example, genomics data is increasingly large and distributed, and the data needs to be moved into workflow execution sites ranging from lab workstations to the cloud. However, the typical user is not always informed on emerging network technology or the most efficient methods to move and share data. Thus, the user defaults to using inefficient methods for transfer across the commercial internet.
To accelerate large data transfer, we created a tool called the Big Data Smart Socket (BDSS) that abstracts data transfer methodology from the user. The user provides BDSS with a manifest of datasets stored in a remote storage repository. BDSS then queries a metadata repository for curated data transfer mechanisms and optimal path to move each of the files in the manifest to the site of workflow execution. BDSS functions as a standalone tool or can be directly integrated into a computational workflow such as provided by the Galaxy Project. To demonstrate applicability, we use BDSS within a biological context, although it is applicable to any scientific domain.
BDSS is available under version 2 of the GNU General Public License at https://github.com/feltus/BDSS .
对于许多科学学科而言,在终端用户的计算资源上长时间集中存储数据的能力变得越来越困难。例如,基因组学数据越来越庞大且分布广泛,需要将这些数据传输到从实验室工作站到云端等各种工作流执行站点。然而,普通用户往往并不了解新兴的网络技术或移动和共享数据的最有效方法。因此,用户默认采用效率低下的方式通过商业互联网进行数据传输。
为了加速大数据传输,我们创建了一个名为大数据智能套接字(BDSS)的工具,该工具将数据传输方法从用户层面进行了抽象。用户向BDSS提供存储在远程存储库中的数据集清单。BDSS随后在元数据存储库中查询经过整理的数据传输机制以及将清单中的每个文件移动到工作流执行站点的最佳路径。BDSS既可以作为独立工具使用,也可以直接集成到诸如Galaxy项目提供的计算工作流中。为了证明其适用性,我们在生物学背景下使用了BDSS,不过它适用于任何科学领域。
BDSS根据GNU通用公共许可证第2版发布,可在https://github.com/feltus/BDSS获取。