Grupo de Arquitectura de Computadores, Universidade da Coruña, A Coruña, Spain.
PLoS One. 2018 Apr 2;13(4):e0194361. doi: 10.1371/journal.pone.0194361. eCollection 2018.
Biclustering techniques are gaining attention in the analysis of large-scale datasets as they identify two-dimensional submatrices where both rows and columns are correlated. In this work we present ParBiBit, a parallel tool to accelerate the search of interesting biclusters on binary datasets, which are very popular on different fields such as genetics, marketing or text mining. It is based on the state-of-the-art sequential Java tool BiBit, which has been proved accurate by several studies, especially on scenarios that result on many large biclusters. ParBiBit uses the same methodology as BiBit (grouping the binary information into patterns) and provides the same results. Nevertheless, our tool significantly improves performance thanks to an efficient implementation based on C++11 that includes support for threads and MPI processes in order to exploit the compute capabilities of modern distributed-memory systems, which provide several multicore CPU nodes interconnected through a network. Our performance evaluation with 18 representative input datasets on two different eight-node systems shows that our tool is significantly faster than the original BiBit. Source code in C++ and MPI running on Linux systems as well as a reference manual are available at https://sourceforge.net/projects/parbibit/.
双聚类技术在分析大规模数据集时越来越受到关注,因为它们可以识别出行和列都相关的二维子矩阵。在这项工作中,我们提出了 ParBiBit,这是一种用于加速二进制数据集上有趣双聚类搜索的并行工具,它在遗传学、市场营销或文本挖掘等不同领域非常流行。它基于已被多项研究证明准确的最新的顺序 Java 工具 BiBit,特别是在会产生许多大型双聚类的场景中。ParBiBit 使用与 BiBit 相同的方法(将二进制信息分组为模式),并提供相同的结果。然而,由于我们的工具是基于 C++11 的高效实现,包括对线程和 MPI 进程的支持,以利用现代分布式内存系统的计算能力,这些系统提供了通过网络连接的多个多核 CPU 节点,因此性能得到了显著提高。我们在两个不同的 8 节点系统上使用 18 个代表性输入数据集进行的性能评估表明,我们的工具比原始 BiBit 快得多。C++和 MPI 的源代码可在 Linux 系统上运行,并提供参考手册,可在 https://sourceforge.net/projects/parbibit/ 上获取。