Varsos Constantinos, Patkos Theodore, Oulas Anastasis, Pavloudi Christina, Gougousis Alexandros, Ijaz Umer Zeeshan, Filiopoulou Irene, Pattakos Nikolaos, Vanden Berghe Edward, Fernández-Guerra Antonio, Faulwetter Sarah, Chatzinikolaou Eva, Pafilis Evangelos, Bekiari Chryssoula, Doerr Martin, Arvanitidis Christos
Institute of Computer Science, Foundation of Research and Technology Hellas, Heraklion, Greece.
Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Crete, Greece.
Biodivers Data J. 2016 Nov 1(4):e8357. doi: 10.3897/BDJ.4.e8357. eCollection 2016.
Parallel data manipulation using R has previously been addressed by members of the R community, however most of these studies produce solutions that are not readily available to the average R user. Our targeted users, ranging from the expert ecologist/microbiologists to computational biologists, often experience difficulties in finding optimal ways to exploit the full capacity of their computational resources. In addition, improving performance of commonly used R scripts becomes increasingly difficult especially with large datasets. Furthermore, the implementations described here can be of significant interest to expert bioinformaticians or R developers. Therefore, our goals can be summarized as: (i) description of a complete methodology for the analysis of large datasets by combining capabilities of diverse R packages, (ii) presentation of their application through a virtual R laboratory (RvLab) that makes execution of complex functions and visualization of results easy and readily available to the end-user.
In this paper, the novelty stems from implementations of parallel methodologies which rely on the processing of data on different levels of abstraction and the availability of these processes through an integrated portal. Parallel implementation R packages, such as the (Programming with Big Data - Interface to MPI) package, are used to implement Single Program Multiple Data (SPMD) parallelization on primitive mathematical operations, allowing for interplay with functions of the package. The and R packages are further integrated offering connections to dataframe like objects (databases) as secondary storage solutions whenever memory demands exceed available RAM resources. The RvLab is running on a PC cluster, using version 3.1.2 (2014-10-31) on a x86_64-pc-linux-gnu (64-bit) platform, and offers an intuitive virtual environmet interface enabling users to perform analysis of ecological and microbial communities based on optimized functions. A beta version of the RvLab is available after registration at: https://portal.lifewatchgreece.eu/.
R社区的成员此前已探讨过使用R进行并行数据处理,然而这些研究中的大多数所产生的解决方案对于普通R用户来说并不容易获取。我们的目标用户,从专家生态学家/微生物学家到计算生物学家,在找到充分利用其计算资源全部能力的最佳方法时常常遇到困难。此外,提高常用R脚本的性能变得越来越困难,尤其是处理大型数据集时。此外,本文所述的实现方式可能会引起专家生物信息学家或R开发者的极大兴趣。因此,我们的目标可概括为:(i)通过结合不同R包的功能来描述一种用于分析大型数据集的完整方法,(ii)通过虚拟R实验室(RvLab)展示其应用,该实验室使复杂函数的执行和结果的可视化变得容易,并且最终用户可以轻松获取。
在本文中,新颖之处在于并行方法的实现,这些方法依赖于在不同抽象层次上处理数据以及通过集成门户提供这些处理过程。并行实现的R包,如(大数据编程 - MPI接口)包,用于在基本数学运算上实现单程序多数据(SPMD)并行化,从而允许与包的函数进行交互。当内存需求超过可用RAM资源时, 和R包进一步集成,提供与类似数据框对象(数据库)的连接作为二级存储解决方案。RvLab在PC集群上运行,在x86_64-pc-linux-gnu(64位)平台上使用版本3.1.2(2014 - 10 - 31),并提供直观的虚拟环境接口,使用户能够基于优化的函数对生态和微生物群落进行分析。在https://portal.lifewatchgreece.eu/注册后可获取RvLab的测试版。