Institute of Neural Information Processing, University of Ulm, 89069 Ulm, Germany.
BMC Bioinformatics. 2010 Apr 6;11:169. doi: 10.1186/1471-2105-11-169.
In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer.
We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization.
Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.
近年来,由于微阵列和其他高通量技术产生的数据集快速增长,计算生物学对计算能力的需求不断增加。这种需求可能还会增加。分析数据的标准算法,如聚类算法,需要并行化以实现快速处理。不幸的是,大多数并行化算法的方法在很大程度上依赖于连接并需要多台计算机的网络通信协议。解决此问题的一种方法是利用当前多核硬件的内在功能,在一台计算机的不同核心之间分配任务。
我们根据事务内存的设计原则,为聚类基因表达微阵列类型数据和分类 SNP 数据引入了 k-均值和 k-模式聚类算法的多核并行化。我们的新共享内存并行算法具有很高的效率。我们展示了它们的计算能力,并通过使用略微更改参数的重复运行来展示它们在聚类稳定性和敏感性分析中的实用性。与单核实现和最近发布的基于网络的并行化相比,我们基于 Java 的算法的计算速度在处理大型数据集时提高了 10 倍,同时保持了计算准确性。
大多数台式计算机甚至笔记本电脑都至少提供双核处理器。我们的多核算法表明,使用现代算法概念,即使在实验室计算机上,也可以实现聚类敏感性和聚类数估计等繁琐任务的并行化。